test(#1346): canary for engine fix #362 reaching fused-Adam path + Skip'd regression for Tensors#396#1386
Conversation
…ip'd regression for Tensors#396 Adds a focused integration test file with two tests: 1. FlashAttentionLayer_TrainViaFusedCompiledAdam_EngagesFusedPath (active) Locks in that AiDotNet.Tensors PR #362 (engine-side Engine.FlashAttention GraphMode.IsActive recording branch, shipped in NuGet 0.81.3) actually reaches the AiDotNet fused-Adam training path. A Transformer<float> with FlashAttentionLayer in its layer stack must engage CompiledTapeTrainingStep<float>.TryStepWithFusedOptimizer on the first Train() call. A future regression that breaks GraphMode trace through FA (or any future compile-side gate that rejects FA-containing graphs) would flip this red. 2. DenseIdentity_CCE_OnFusedAdam_DoesNotSilentlyZeroNaN (Skip'd) Future-fix regression for the consumer-side gap the #1346 investigation surfaced, now tracked at ooples/AiDotNet.Tensors#396. When a model with multiple trainable parameters routes raw logits (IdentityActivation Dense) through CategoricalCrossEntropyLoss on the fused-Adam path, the CCE chain produces NaN that the Tensors loss readout silently zeroes — consumer sees GetLastLoss=0 ('converged') while gradients are corrupted. Unskip once Tensors#396 fix ships and the consuming NuGet bumps. Closes #1346 — the engine-side fix is in master via #362; the remaining consumer-side fix lives in AiDotNet.Tensors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds a new integration test file that verifies FlashAttentionLayer engages the compiled fused-Adam path and includes a skipped regression test ensuring last-loss is not a literal zero silent-failure. ChangesFlashAttention Fused-Compiled Training Regression Tests
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly Related Issues
Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@tests/AiDotNet.Tests/IntegrationTests/NeuralNetworks/FlashAttentionFusedCompiledTrainingIssue1346Tests.cs`:
- Around line 37-38: Annotate the test class
FlashAttentionFusedCompiledTrainingIssue1346Tests with an xUnit collection
attribute and create a matching CollectionDefinition with DisableParallelization
= true to ensure the global CompiledTapeTrainingStep<float> fused-step counter
is not shared between parallel test threads; specifically, add
[Collection("FlashAttentionFusedTests")] to
FlashAttentionFusedCompiledTrainingIssue1346Tests and add a
CollectionDefinition("FlashAttentionFusedTests", DisableParallelization = true)
class to the test project so the tests run in isolation and the fused-step
counter reads/resets won't interleave.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 05d27118-9dac-4662-9383-f12e5e67f430
📒 Files selected for processing (1)
tests/AiDotNet.Tests/IntegrationTests/NeuralNetworks/FlashAttentionFusedCompiledTrainingIssue1346Tests.cs
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds integration test coverage to lock in behavior around AiDotNet#1346 and the follow-up AiDotNet.Tensors#396, specifically focusing on the compiled fused-Adam training path and loss readout behavior.
Changes:
- Add a canary integration test ensuring a Transformer with
FlashAttentionLayerengages the fused compiled-Adam training step. - Add a skipped regression test documenting and guarding against the fused-plan loss readout silently returning
0for NaN/Inf cases (blocked on Tensors#396).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
PR #1386 review C8Cn3 thread on Task.Yield/async: keeping the |
…sedOptimizerGlobalState collection
Summary
Adds focused integration test coverage tied to AiDotNet#1346 (now closed) and the follow-up tracked at AiDotNet.Tensors#396.
Two tests in one new file:
FlashAttentionLayer_TrainViaFusedCompiledAdam_EngagesFusedPath(active, passes) — canary that locks in AiDotNet.Tensors PR [Test Coverage] Implement Tests for Basic Wavelets #362's reach into the AiDotNet fused-Adam training path. ATransformer<float>withFlashAttentionLayerin its explicit layer stack must engageCompiledTapeTrainingStep<float>.TryStepWithFusedOptimizeron the firstTrain()call. Pre-fix this canary would be invisible (the fused step ran but FA gradients were silently zeroed). Post-fix the engagement counter is the proxy for "the engine fix reached this path".DenseIdentity_CCE_OnFusedAdam_DoesNotSilentlyZeroNaN(Skip'd until Tensors#396 ships) — future-fix regression for the consumer-side gap surfaced during the flashattentionlayer produces degenerate output + 3.76x slowdown vs documented 2-4x speedup #1346 investigation. When a multi-parameter model routes raw logits (IdentityActivation Dense) throughCategoricalCrossEntropyLosson the fused-Adam path, the CCElog(negative_logit + eps)chain produces NaN that the Tensors loss-readout silently zeroes —GetLastLoss()returns literal0.0, the consumer thinks training converged, and the model never moves off random init. The Skip message includes the unblock condition (Tensors#396 fix + NuGet bump).Why this scope
The bug investigation for #1346 found that the original engine-side gradient drop was correctly fixed by AiDotNet.Tensors PR #362 (in NuGet 0.81.3). The remaining "consumer still sees degenerate output" symptom turned out to be a much larger, not-FA-specific bug in the Tensors loss readout — diagnosed via 5 controlled bisection experiments and filed in detail at AiDotNet.Tensors#396. That fix lives in
AiDotNet.Tensors/Engines/Compilation/CompiledTrainingPlan.cs, not in AiDotNet. AiDotNet#1346 is closed.Test plan
dotnet build tests/AiDotNet.Tests/AiDotNetTests.csproj --framework net10.0 -c Release— 0 errorsdotnet test ... --filter "FullyQualifiedName~FlashAttentionFusedCompiledTrainingIssue1346Tests" --no-build— 1 passed, 1 skippedLocal run output:
🤖 Generated with Claude Code
Summary by CodeRabbit