Skip to content

[US-BF-025] Add InitializeRandomSolution method to OptimizerBase#196

Merged
ooples merged 8 commits intomerge-dev2-to-masterfrom
fix/us-bf-025-initialize-random-solution
Oct 23, 2025
Merged

[US-BF-025] Add InitializeRandomSolution method to OptimizerBase#196
ooples merged 8 commits intomerge-dev2-to-masterfrom
fix/us-bf-025-initialize-random-solution

Conversation

@ooples
Copy link
Copy Markdown
Owner

@ooples ooples commented Oct 23, 2025

Summary

Fixes 272 build errors by adding the missing InitializeRandomSolution method to OptimizerBase.cs.

Changes

  • File Modified: src/Optimizers/OptimizerBase.cs
  • Build Impact: Eliminates 272 CS0103 errors for InitializeRandomSolution

Implementation Details

  • Added protected virtual method InitializeRandomSolution(Vector<T> lowerBounds, Vector<T> upperBounds)
  • Method generates random solutions within specified bounds using NumOps for type-safe operations
  • Uses existing Random field from OptimizerBase
  • Virtual modifier allows subclass overrides if needed

Method Signature

protected virtual Vector<T> InitializeRandomSolution(Vector<T> lowerBounds, Vector<T> upperBounds)

Verification

  • ✅ Build verification: 0 InitializeRandomSolution errors after changes
  • ✅ All 30+ optimizer subclasses now compile successfully
  • ✅ Method properly validates input parameters

Affected Optimizers

This fixes build errors in 30+ optimizer files including:

  • AdaDeltaOptimizer, AdagradOptimizer, AdaMaxOptimizer, AdamOptimizer
  • AntColonyOptimizer, NelderMeadOptimizer, SimulatedAnnealingOptimizer
  • And many more population-based and metaheuristic optimizers

Related

  • User Story: US-BF-025
  • Commit: bdb35e2
  • Branch: fix/us-bf-025-initialize-random-solution

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings October 23, 2025 15:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds the missing InitializeRandomSolution method to the OptimizerBase class to resolve 272 build errors where optimizer subclasses were calling this non-existent method.

Key Changes:

  • Added a protected virtual method InitializeRandomSolution that generates random solution vectors within specified bounds
  • Method uses type-safe numeric operations via NumOps for generic type compatibility
  • Includes parameter validation for null checks and length matching

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread src/Optimizers/OptimizerBase.cs
ooples and others added 2 commits October 23, 2025 14:20
Add validation to ensure lowerBounds[i] <= upperBounds[i] for all dimensions
before generating random solution. Throws ArgumentException if any lower bound
exceeds its corresponding upper bound.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings October 23, 2025 18:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread src/Optimizers/OptimizerBase.cs Outdated
ooples and others added 2 commits October 23, 2025 14:43
- Added bounds validation loop before random solution generation
- Fixed documentation to add periods to param tags
- Ensures lowerBounds[i] <= upperBounds[i] for all dimensions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings October 23, 2025 18:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread src/Optimizers/OptimizerBase.cs
@ooples ooples requested a review from Copilot October 23, 2025 20:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread src/Optimizers/OptimizerBase.cs Outdated
ooples and others added 2 commits October 23, 2025 16:41
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Franklin Moormann <cheatcountry@gmail.com>
Copilot AI review requested due to automatic review settings October 23, 2025 22:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

}
}

var solution = new Vector<T>(lowerBounds.Length);
Copy link

Copilot AI Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector initialization may fail without proper element initialization. Consider using Vector.Build.Dense or equivalent method to ensure proper allocation and initialization of the vector structure.

Suggested change
var solution = new Vector<T>(lowerBounds.Length);
var solution = Vector<T>.Build.Dense(lowerBounds.Length);

Copilot uses AI. Check for mistakes.
@ooples ooples merged commit 116b18c into merge-dev2-to-master Oct 23, 2025
0 of 2 checks passed
@ooples ooples deleted the fix/us-bf-025-initialize-random-solution branch October 23, 2025 22:05
ooples added a commit that referenced this pull request Apr 18, 2026
…mute

Replaces the scalar nested-loop implementations of Patchify, Unpatchify,
ReshapeForHeads, ReshapeFromHeads, and the ExtractModulation/ApplyAdaLN/
AddWithGate helpers with their Engine-op equivalents — reshape + permute +
reshape pipelines and zero-copy TensorSliceAxis views off the AdaLN
modulation tensor.

Specific changes:

  * Patchify/Unpatchify: replace the 6-deep scalar nested loop with
    Engine.Reshape → Engine.TensorPermute → Engine.Reshape. The permute
    runs through the engine's vectorized memcpy kernel (or stays as a
    view when the downstream consumer supports strided) instead of a
    per-element C# scalar copy.

  * ReshapeForHeads/FromHeads: same pattern (reshape + permute + reshape)
    instead of the original triple-nested scalar copy with span slices.

  * ExtractModulation eliminated entirely. Previously ForwardBlock did 6
    ExtractModulation calls per block (24 blocks × 50 inference steps ×
    6 = 7200 T[] allocations per Predict). Now ForwardBlock reshapes the
    AdaLN modulation output to [B, 6, 1, H] once and slices out each
    shift/scale/gate via Engine.TensorSliceAxis — zero allocations, zero
    scalar fill loops.

  * ApplyAdaLN / AddWithGate rewritten to accept Tensor<T> broadcast
    views (from TensorSliceAxis) instead of T[] scalar arrays. The
    previous implementations built a [1,1,H] broadcast tensor via
    TensorAllocator.Rent + a per-element scalar fill; the new ones use
    Engine.TensorAddScalar / Engine.TensorBroadcastMultiply / Engine.
    TensorBroadcastAdd directly on the sliced views.

  * EmbedPatches / FinalLayerWithAdaLN: replaced the
    TensorAllocator.Rent + CopyTo scratch-buffer round trips with
    Engine.Reshape view chains (the downstream dense forward is
    contiguous-input-tolerant).

Every hot-path scalar copy in DiT forward is now either a view
(zero-copy) or a SIMD-vectorized engine op. Depends on the matching
AiDotNet.Tensors PR #196 for the double-precision SIMD fallbacks in
TensorMatMul / ScaledDotProductAttention / FusedLinear / broadcast ops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ooples added a commit that referenced this pull request Apr 18, 2026
…mute

Replaces the scalar nested-loop implementations of Patchify, Unpatchify,
ReshapeForHeads, ReshapeFromHeads, and the ExtractModulation/ApplyAdaLN/
AddWithGate helpers with their Engine-op equivalents — reshape + permute +
reshape pipelines and zero-copy TensorSliceAxis views off the AdaLN
modulation tensor.

Specific changes:

  * Patchify/Unpatchify: replace the 6-deep scalar nested loop with
    Engine.Reshape → Engine.TensorPermute → Engine.Reshape. The permute
    runs through the engine's vectorized memcpy kernel (or stays as a
    view when the downstream consumer supports strided) instead of a
    per-element C# scalar copy.

  * ReshapeForHeads/FromHeads: same pattern (reshape + permute + reshape)
    instead of the original triple-nested scalar copy with span slices.

  * ExtractModulation eliminated entirely. Previously ForwardBlock did 6
    ExtractModulation calls per block (24 blocks × 50 inference steps ×
    6 = 7200 T[] allocations per Predict). Now ForwardBlock reshapes the
    AdaLN modulation output to [B, 6, 1, H] once and slices out each
    shift/scale/gate via Engine.TensorSliceAxis — zero allocations, zero
    scalar fill loops.

  * ApplyAdaLN / AddWithGate rewritten to accept Tensor<T> broadcast
    views (from TensorSliceAxis) instead of T[] scalar arrays. The
    previous implementations built a [1,1,H] broadcast tensor via
    TensorAllocator.Rent + a per-element scalar fill; the new ones use
    Engine.TensorAddScalar / Engine.TensorBroadcastMultiply / Engine.
    TensorBroadcastAdd directly on the sliced views.

  * EmbedPatches / FinalLayerWithAdaLN: replaced the
    TensorAllocator.Rent + CopyTo scratch-buffer round trips with
    Engine.Reshape view chains (the downstream dense forward is
    contiguous-input-tolerant).

Every hot-path scalar copy in DiT forward is now either a view
(zero-copy) or a SIMD-vectorized engine op. Depends on the matching
AiDotNet.Tensors PR #196 for the double-precision SIMD fallbacks in
TensorMatMul / ScaledDotProductAttention / FusedLinear / broadcast ops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ooples added a commit that referenced this pull request Apr 18, 2026
Pulls in the Tensors SIMD fallback fixes from Tensors PR #196:
  - TensorMatMul double fallback routed through MultiplyBlocked
  - ScaledDotProductAttention double SIMD fast path
  - FusedGemmBiasActivation double fallback SIMD-routed
  - TensorBroadcast{Multiply,Add} trailing-repeat fast path
  - Odometer-based Contiguous() materialization
  - LayerNorm generic fallback uses SIMD numOps.Sum

Unblocks the DiT vectorization work in this PR — every
double-precision matmul / broadcast / attention op it relies on now
hits a SIMD path instead of a scalar triple-loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ooples added a commit that referenced this pull request Apr 18, 2026
Pulls in the Tensors SIMD fallback fixes from Tensors PR #196:
  - TensorMatMul double fallback routed through MultiplyBlocked
  - ScaledDotProductAttention double SIMD fast path
  - FusedGemmBiasActivation double fallback SIMD-routed
  - TensorBroadcast{Multiply,Add} trailing-repeat fast path
  - Odometer-based Contiguous() materialization
  - LayerNorm generic fallback uses SIMD numOps.Sum

Unblocks the DiT vectorization work in this PR — every
double-precision matmul / broadcast / attention op it relies on now
hits a SIMD path instead of a scalar triple-loop.

Also unblocks MobileNetV3_Train_CompletesWithoutError which hit the
TensorCopy source.Length regression (Tensors PR #195, included in
0.46.1 via #194's follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ooples added a commit that referenced this pull request Apr 19, 2026
…#1156)

* fix(stats): break BasicStats.CalculateStats recursion that crashed test host

BasicStats's lazy-stats accessors all read through property getters that
call EnsureFullStatsComputed -> CalculateStats. When CalculateStats
itself reads any of those properties (N, Mean, Variance,
StandardDeviation, Median, FirstQuartile, ThirdQuartile), the getter
re-enters EnsureFullStatsComputed because _fullStatsComputed is still
false during the body of CalculateStats — that flag is only set after
CalculateStats returns. The result is unbounded recursion that crashes
the xUnit test host with a StackOverflowException.

Stack from CI failures:
  BasicStats<double>.CalculateStats(Vector<double>)
  BasicStats<double>.EnsureFullStatsComputed()
  BasicStats<double>.get_N()                       // <-- re-entry
  BasicStats<double>.CalculateStats(Vector<double>)
  ...

Reported as the "Test Run Aborted — host process exited unexpectedly"
on these CI jobs (PR #1154 / master):
  - AiDotNet.Serving.Tests
  - ModelFamily - Classification
  - ModelFamily - Clustering/GP
  - ModelFamily - Regression
  - ModelFamily - TimeSeries/Activation/Loss
  - Unit - 04 Feature/Fit/Fitness/Genetics

Fix: compute every intermediate value into a local variable, only
assign to the publicly-observable properties at the end. Property reads
never happen inside CalculateStats, so the lazy getter never re-enters.

Verified locally: FederatedRun_Lifecycle_FedAvg_AggregatesAndAdvancesRound
(which serializes a model and triggers the lazy stats path) now passes
end-to-end instead of crashing the host.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* test(data): cross-platform retry trigger for RobustFileOps tests

Two RobustFileOps retry tests passed on Windows but failed on the Linux
CI runner because FileShare.None on a FileStream does not actually
block File.Move on POSIX:

  - Move_SucceedsAfter_TransientSharingViolation
  - Move_Propagates_WhenLockNeverReleases

Both used a held FileStream with FileShare.None as the
"failed-attempt" trigger. On Linux that does not block rename(2), so
File.Move succeeded on the first attempt — Move_Propagates' Assert.
Throws fired ("No exception was thrown") and Move_SucceedsAfter
short-circuited without ever exercising the retry loop.

Replaced the lock-based simulation with a cross-platform missing-
parent-directory trigger:

  - Move_SucceedsAfter_TransientSharingViolation: destination's parent
    directory does not exist when MoveWithRetryAsync runs. File.Move
    throws DirectoryNotFoundException (an IOException subclass) on
    each attempt. A background task creates the parent ~250 ms in,
    so a subsequent attempt succeeds. Retry path is exercised on
    every platform.
  - Move_Propagates_WhenLockNeverReleases: parent directory is never
    created. Every attempt throws DirectoryNotFoundException; the
    final attempt must propagate. Test now asserts the more specific
    DirectoryNotFoundException type for clarity, and adds a check
    that the source file is still in place after the failed move
    (the move never started, so src must remain).

Verified locally: all 5 RobustFileOpsMoveRetryTests pass on net10.0.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* fix(serialization): match MultiHeadAttentionLayer 5-arg constructor in deserializer

DeserializationHelper.CreateMultiHeadAttentionLayer was looking up a
4-parameter constructor signature

  (int, int, int, IActivationFunction<T>)

but MultiHeadAttentionLayer<T>'s constructor is actually 5-parameter:

  (int, int, int, IActivationFunction<T>?, IInitializationStrategy<T>?)

Type.GetConstructor matches by exact parameter list, not by "first N
plus defaults," so the lookup returned null and threw

  "Cannot find MultiHeadAttentionLayer constructor with
   (int, int, int, IActivationFunction<T>)"

Failure path observed in CI:
  - InferenceOptimizer.OptimizeForInference(model, cloneModel: true)
    -> NeuralNetworkBase.Clone (serialization round-trip)
      -> DeserializationHelper.CreateMultiHeadAttentionLayer (throws)
    -> caught in OptimizeForInference, returns (model, false)
  - Test InferenceOptimizer_RewritesMultiHeadAttention_To
    CachedAttention_ForTextGeneration_WhenKVCacheEnabled then sees
    anyApplied == false instead of the expected rewrite.

The fix mirrors how CreateDenseLayer already passes
IInitializationStrategy<T> in its constructor lookup. Pass null for
the strategy slot, matching the constructor's default-value semantics.

Verified locally: all 9 InferenceOptimizerTests pass on net10.0.

Wider impact: this also unblocks Clone-via-serialization for any model
containing MHA layers — previously every transformer-style model would
silently skip inference optimizations after clone failed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* fix(optimizer): re-allocate Adam moments when cached shape mismatches param

AdamOptimizer.Step keyed its per-parameter moment tensors (_tapeM,
_tapeV) by Tensor reference. If a parameter was first seen while a
lazy-initialized layer (e.g. MultiHeadAttentionLayer with
IsLazy: true initialization strategy) had its weights allocated as
the placeholder [0, 0] tensor, the cached m / v captured shape
[0, 0] and Length 0. Once the layer materialized real weights and
real-shape gradients arrived, mScaled and gradScaled differed in
shape; TensorAdd broadcast to the larger shape and the result no
longer matched m's underlying buffer.

Fix: at every Step, validate the cached m and v match the parameter's
current shape via SequenceEqual, and re-allocate if not. Identity
caching by reference still works for stable parameters; the explicit
shape check covers the lazy-init case.

Note: this fix alone is not sufficient to make
MobileNetV3_Train_CompletesWithoutError pass — that test also hits a
separate bug in AiDotNet.Tensors (CpuEngine.TensorCopy uses
sourceArray.Length instead of source.Length, see follow-up PR on the
Tensors repo). This commit fixes the lazy-init half of the issue,
which would otherwise mask the Tensors bug behind a noisier symptom.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* fix(serving): cross-platform sanitizer for AesGcm artifact filenames

Path.GetInvalidFileNameChars returns a platform-specific set:
  - Windows: includes ':', '\', '*', '?', '<', '>', '|', '"' plus
    control chars 1-31
  - Linux / macOS: only '\0' and '/'

Encrypted model artifacts are designed to be portable across operating
systems (an artifact written on a Linux training cluster might be
loaded on a Windows inference host). Using the platform-specific set
broke the AesGcmModelArtifactProtectorTests.
ProtectToFile_WritesHeaderAndReturnsArtifact test on Linux CI:
  expected "my_model.aidn.enc"
  actual   "my:model.aidn.enc"   (':' isn't invalid on POSIX)

Fix: replace Path.GetInvalidFileNameChars with a hardcoded
cross-platform-invalid set that combines the Windows superset with
POSIX. Now the sanitizer produces identical output on every OS, so
artifacts are guaranteed mountable everywhere.

Verified locally: ProtectToFile_WritesHeaderAndReturnsArtifact passes
on net10.0.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* fix(layers): sparselinearlayer reports supportstraining true

The layer's SupportsTraining property previously returned false with a
detailed comment explaining that sparse weight tensors don't fit the
tape's dense ParameterBuffer<T> contract. But returning false was
incorrect: SupportsTraining gates the LEGACY non-tape training path
(`if (layer.SupportsTraining) layer.UpdateParameters(lr)`), and the
layer DOES have a working UpdateParameters that updates both the
sparse weight tensor and the dense bias vector from gradients
computed in Backward. Setting it to false was preventing the layer
from training in the legacy path even though the update mechanism
existed.

Tape-mode discovery is unaffected by SupportsTraining — that path
uses [TrainableParameter] / RegisterTrainableParameter discovery, not
this property. The sparse weight tensor remains invisible to tape
mode pending sparse-aware ParameterBuffer<T> support, which is a
separate architectural follow-up.

Updated docstring to describe the actual semantics (legacy path
trains the layer; tape-mode caveat documented inline).

Verified locally: SparseLinearLayer_SupportsTraining_IsTrue passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(dit): vectorize Patchify/Unpatchify/AdaLN via Engine reshape+permute

Replaces the scalar nested-loop implementations of Patchify, Unpatchify,
ReshapeForHeads, ReshapeFromHeads, and the ExtractModulation/ApplyAdaLN/
AddWithGate helpers with their Engine-op equivalents — reshape + permute +
reshape pipelines and zero-copy TensorSliceAxis views off the AdaLN
modulation tensor.

Specific changes:

  * Patchify/Unpatchify: replace the 6-deep scalar nested loop with
    Engine.Reshape → Engine.TensorPermute → Engine.Reshape. The permute
    runs through the engine's vectorized memcpy kernel (or stays as a
    view when the downstream consumer supports strided) instead of a
    per-element C# scalar copy.

  * ReshapeForHeads/FromHeads: same pattern (reshape + permute + reshape)
    instead of the original triple-nested scalar copy with span slices.

  * ExtractModulation eliminated entirely. Previously ForwardBlock did 6
    ExtractModulation calls per block (24 blocks × 50 inference steps ×
    6 = 7200 T[] allocations per Predict). Now ForwardBlock reshapes the
    AdaLN modulation output to [B, 6, 1, H] once and slices out each
    shift/scale/gate via Engine.TensorSliceAxis — zero allocations, zero
    scalar fill loops.

  * ApplyAdaLN / AddWithGate rewritten to accept Tensor<T> broadcast
    views (from TensorSliceAxis) instead of T[] scalar arrays. The
    previous implementations built a [1,1,H] broadcast tensor via
    TensorAllocator.Rent + a per-element scalar fill; the new ones use
    Engine.TensorAddScalar / Engine.TensorBroadcastMultiply / Engine.
    TensorBroadcastAdd directly on the sliced views.

  * EmbedPatches / FinalLayerWithAdaLN: replaced the
    TensorAllocator.Rent + CopyTo scratch-buffer round trips with
    Engine.Reshape view chains (the downstream dense forward is
    contiguous-input-tolerant).

Every hot-path scalar copy in DiT forward is now either a view
(zero-copy) or a SIMD-vectorized engine op. Depends on the matching
AiDotNet.Tensors PR #196 for the double-precision SIMD fallbacks in
TensorMatMul / ScaledDotProductAttention / FusedLinear / broadcast ops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(init): batched parallel Xavier normal weight initialization

Replaces the per-element SampleGaussian call loop (which ran a
virtual-dispatch Box-Muller + rejection test for every element) with a
tight specialized fill routine for double and float: one paired
Box-Muller transform produces two samples per pair of uniform draws,
halving the log/sqrt/sin/cos call count, and large layers (≥ 256K
elements) are partitioned across the thread pool so the ~29s of init
cost per DiT-XL-sized Dense layer (hidden 8192 × out 12288 = 100M
doubles per AdaLN modulation layer) is parallelized instead of running
single-threaded.

Context: even after the Tensors-side SIMD fixes on the forward matmul
path, the first Pika21 Predict paid ~150s of lazy-init overhead across
the 24 block layers because each first-call XavierNormalInitialize hit
a scalar loop doing 100M virtual calls. The cost is one-time per layer
but it dominated the first forward and pushed Training_Should* tests
that exercise a fresh model over the per-test xUnit budget.

Preserves reproducibility: per-chunk RNGs are seeded deterministically
from the master Random instance, so for a given parent seed the output
is stable across thread counts. Keeps the generic-T fallback on the
old path since only float/double are expected to be perf-critical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump aidotnet.tensors 0.46.0 -> 0.46.1

Pulls in the Tensors SIMD fallback fixes from Tensors PR #196:
  - TensorMatMul double fallback routed through MultiplyBlocked
  - ScaledDotProductAttention double SIMD fast path
  - FusedGemmBiasActivation double fallback SIMD-routed
  - TensorBroadcast{Multiply,Add} trailing-repeat fast path
  - Odometer-based Contiguous() materialization
  - LayerNorm generic fallback uses SIMD numOps.Sum

Unblocks the DiT vectorization work in this PR — every
double-precision matmul / broadcast / attention op it relies on now
hits a SIMD path instead of a scalar triple-loop.

Also unblocks MobileNetV3_Train_CompletesWithoutError which hit the
TensorCopy source.Length regression (Tensors PR #195, included in
0.46.1 via #194's follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(stats): break EnsureFullStatsComputed recursion in errorstats/modelstats/predictionstats

Same bug class as the earlier BasicStats fix: the Calculate* method was
assigning to properties AND reading them back during its own body, but
the property getters call EnsureFullStatsComputed — which is still
running the Calculate* method. The _fullStatsComputed flag only flips
after Calculate* returns, so any intra-method property read re-enters
Calculate* unbounded. The test host crashes with StackOverflowException
before the test framework can report anything except "host process
exited unexpectedly."

Specific re-entry points the previous code had:

  * ErrorStats.CalculateErrorStats
    - RMSE = _numOps.Sqrt(MSE)              ← re-enters via MSE getter
    - AIC/BIC/AICAlt pass RSS                ← re-enters via RSS getter

  * ModelStats.CalculateModelStats
    - VIFList = ... CalculateVIF(CorrelationMatrix, ...) ← CorrelationMatrix
    - Mahalanobis block reads CovarianceMatrix thrice  ← CovarianceMatrix

  * PredictionStats.CalculatePredictionStats
    - AdjustedR2 = ... CalculateAdjustedR2(R2, ...)         ← R2
    - PredictionIntervalCoverage = ... (PredictionInterval.Lower,
      PredictionInterval.Upper)                             ← PredictionInterval
    - ConfidenceInterval/CredibleInterval read BestDistributionFit
      .DistributionType                                     ← BestDistributionFit

All three methods are rewritten to compute every intermediate into a
local variable first; properties are only assigned once every dependency
is a local. No property reads happen inside Calculate*, so the lazy
getter never re-enters.

Observed failure path (Classification CI shard, PR #1156 run):
  AdaBoostClassifierTests.Predict_ShouldBeDeterministic trains the
  model, which computes ErrorStats, which stack-overflows the host.
  Other crashed tests in the same shard:
    - ExtraTreesClassifierTests.Clone_ShouldProduceIdenticalPredictions
    - CategoricalNaiveBayesTests.Builder_AccuracyShouldBeatChance
    - OneVsRestClassifierTests.Builder_AccuracyShouldBeatChance
  All 4 pass locally after this fix.

Unblocks the host_crash jobs on PR #1154 triage:
  - ModelFamily - Classification
  - ModelFamily - Clustering/GP
  - ModelFamily - Regression
  - ModelFamily - TimeSeries/Activation/Loss
  - Unit - 04 Feature/Fit/Fitness/Genetics
  - AiDotNet.Serving.Tests

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(networks): resnet/vgg train adds batch dim for 3d input

ResNet/VGG's Forward() explicitly accepts 3D [C,H,W] input and expands
it to 4D [1,C,H,W] before running the layer stack. Their Train()
overrides, however, called TrainWithTape directly — which delegates to
NeuralNetworkBase.ForwardForTraining, which does NOT add a batch dim
and just runs the raw tensor through every layer.

For a 3D input [3, 32, 32], the conv/pool chain preserves the rank-3
shape and the classifier's AdaptiveAveragePool + Flatten ends up
producing [512, 1] (the 512 final-block channel count gets treated as
a batch dim by FlattenLayer.Forward's "preserve first dim" rule). The
final DenseLayer with inputSize=512 sees actualInputSize=1 via
input.Shape[^1], calls EnsureWeightShapeForInput(1) which resizes
weights to [1, 10], and produces [512, 10] — which then fails the
loss shape check in EnsureTargetMatchesPredicted because the target
is [10].

Fix: mirror Forward()'s expansion in Train() — when input is 3D, add
a leading batch dim to BOTH input and target before dispatching to
TrainWithTape. Any 4D input is passed through untouched. The target
expansion is guarded so a caller that already provided a batched
target is not double-expanded.

Verified locally, all 4 of the previously-failing tests now pass:
  - ResNetNetwork_Train_CompletesWithoutError
  - ResNetNetwork_Train_LossDecreases
  - VGGNetwork_Train_CompletesWithoutError
  - VGGNetwork_Train_LossDecreases

Closes the 08a NN-Classic (ResNet/VGG/DenseNet) CI shard failure from
the PR #1154 triage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(networks): mobilenetv2 handles 3d input in forward/train/namedactivations

Same structural bug as ResNet/VGG: MobileNetV2's Forward / Train /
GetNamedLayerActivations all iterated the layer stack with the raw
input. For 3D [C, H, W] inputs, BatchNormalizationLayer's channel
scale (shape [1, C, 1, 1]) cannot broadcast against the 3D layout
because dim 1 of the input (spatial H) doesn't match the BN's C
channel count:
  "Tensors with shapes [16, 32, 32] and [1, 16, 1] cannot be broadcast:
   dimension 1 has sizes 32 and 16 (must be equal or one must be 1)."

Fix: add a leading batch dimension when the caller passes a 3D input
so every BN in every InvertedResidualBlock sees the 4D layout it
requires, and squeeze it back off at the end of Forward so the output
shape matches the caller's 3D contract. Train() expands both input
and target the same way so ForwardForTraining (which iterates layers
without adding batch dim) also sees the correct shape.
GetNamedLayerActivations is overridden with the same expansion so the
layer-by-layer probe used by NamedLayerActivations_ShouldBeNonEmpty
doesn't hit the same BN broadcast error.

Also fixes the test: the parameterless MobileNetV2Network constructor
defaults to 1000 ImageNet classes and 224x224 input; the test probed
with 3x64x64 and 10-class OutputShape. Swap in the architecture-aware
overload so the classifier head matches the expected output dim.

Goes from 0/17 passing on the previous config to 14/17 passing — the
three remaining failures are a deeper shape-collapse issue inside the
InvertedResidualBlock chain for the NamedLayerActivations probe and a
perf timeout on the training tests, both of which are separate from
this broadcast-shape root cause.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(networks): instructorembedding test shape matches 768-dim model

InstructorEmbedding's default ctor builds a 768-dim transformer
(inputSize=768, outputSize=768) but the test inherited the base
class's default InputShape=[1, 4] and OutputShape=[1, 1]. The training
tests fed a [1, 4] input to a 768-dim model and a [1, 1] target that
the loss function then tried to subtract from the model's [1, 768]
prediction, throwing "Tensor shapes must match. Got [1, 768] and
[1, 1]." in MeanSquaredErrorLoss.ComputeTapeLoss.

Fix: override InputShape/OutputShape to the model's actual 768-dim
embedding layout so input, prediction, and target all align.

Closes the InstructorEmbedding part of the "ModelFamily - NeuralNetworks"
CI shard failure from the PR #1154 triage (remaining failures in that
shard are MobileNetV2 and are addressed in the previous commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(networks): convolutionalneuralnetwork train adds batch dim for 3d input

Same 3D-input bug as ResNet/VGG/MobileNetV2: CNN's Train() called
TrainWithTape with the raw 3D [C, H, W] tensor. ForwardForTraining
iterates layers without a shape-adjustment step, so the final
FlattenLayer treats the 32-channel dimension as a batch
(preserve-first-dim rule) and produces a [32, 10] prediction against
a [10] one-hot target — fails EnsureTargetMatchesPredicted with
"Target shape dimension 0 (10) does not match predicted shape
dimension 0 (32)."

Fix: expand 3D input to 4D before dispatching to TrainWithTape, and
expand the target too when the caller provided it without a batch dim.

All 5 previously-failing CNN tests pass locally:
  - TrainingError_ShouldNotExceedTestError
  - Training_ShouldReduceLoss
  - Training_ShouldChangeParameters
  - GradientFlow_ShouldBeNonZeroAndFinite
  - ForwardPass_ShouldBeFinite_AfterTraining

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(networks): unet3d decoder channel count + test output shape

Two related problems surfaced by every UNet3D test:

1. LayerHelper.CreateDefaultUNet3DLayers — the decoder path declared
   the first Conv3D of each non-bottleneck-adjacent block with
   `inChannels = encoderFilters[block + 1] * 2`. The "*2" was there to
   account for a full U-Net concatenating skip connections from the
   encoder at each decoder level. This implementation does NOT
   actually perform the concatenation, so the preceding decoder
   block's Second-Conv3D emitted encoderFilters[block + 1] channels,
   not double that. Every CI call (and every local Predict) hit
   "Input channels (128) must match kernel in_channels (256)" in the
   first decoder block after the one adjacent to the bottleneck.

   Fix: drop the "*2" so the declared in_channels match the tensors
   that actually flow through. Concatenating real skip connections is
   a separate architectural improvement.

2. UNet3DTests — OutputShape declared as [1], treating the network as
   a classifier, but UNet3D is a per-voxel segmentation model whose
   final 1x1x1 Conv3D emits [numClasses, D, H, W] per sample. With
   default numClasses=1 and 32³ voxel grid, every training test tried
   to subtract a [1, 32, 32, 32] prediction from a [1] target and
   threw "Tensor shapes must match. Got [1, 32, 32, 32] and [1]."

   Fix: OutputShape → [1, 32, 32, 32] so input, prediction, and
   target all line up.

Goes from 0/17 passing on UNet3D to 12/17. The five remaining
failures are separate issues (NaN during training for this conv stack,
metadata parity) that are independent of these two root causes.

Closes 7 of the 8 UNet3D failures from the PR #1154 CI triage that
were all attributed to the "Input channels (128) vs (256)" error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gp): escalating cholesky jitter for sparsegaussianprocess.fit

Ky = Kuu + D·Kuf·Kuf^T is only positive-semi-definite in exact
arithmetic, so floating-point roundoff on the combined matrix
routinely pushes the smallest eigenvalue just below zero and
CholeskyDecomposition throws "Matrix is not positive definite" on
every SparseGaussianProcess fit. Kuu already gets a constant 1e-4
jitter before its Cholesky, but the Ky path had none — that produced
the six SparseGaussianProcessTests failures in the PR #1156 CI shard.

Add a PyTorch/GPyTorch-style escalating jitter schedule (1e-6 →
1e-4 → 1e-2 → 1e-1, scaled by the matrix trace so it's invariant to
kernel amplitude) and retry the Cholesky after each increment.
Geometric escalation instead of a single larger constant keeps the
numerical error introduced for already-well-conditioned matrices
minimal while still rescuing the borderline cases.

Goes from 7/16 passing to 14/16 on SparseGaussianProcessTests.
Remaining two failures are separate bugs (predictive mean is NaN,
not a PD-matrix issue) tracked independently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(generators): correct audio/video modeldomain ordinal in testscaffoldgenerator

ModelDomain enum order is General=0, Vision=1, Language=2, Audio=3,
Video=4, Multimodal=5. The scaffold generator had Audio and Video
ordinals swapped in three places:

  1. Line 1495 — treats Domain=3 as "temporal video" and emits
     `throw new NotImplementedException(...)` in the test's
     CreateNetwork. Audio is 3, not 4, so EVERY audio model
     (PlayHT, Bark, StableAudio, etc.) got a NotImplementedException
     factory instead of a working architecture. Ten PlayHTTests
     failures on PR #1156 traced back to this single line.

  2. Line 1520 — `isAudio = Domains.Contains(4)`. Should be 3.

  3. Line 1633 — `isVideoModel = Domains.Contains(3)`. Should be 4.

All three sites now use the correct ordinals (Audio=3, Video=4).

This aligns the generator with the enum and the facade/customization
pattern the project prefers over hard-coded factories — every audio
model's test can now construct a real Architecture and run the test
body (which exposes the real model-specific failures downstream,
where they can be fixed in the model code rather than hidden behind
a runtime factory stub).

PlayHTTests go from 0/21 passing (all NotImplementedException) to
2/21 (metadata/parameter-count tests now execute). The remaining 19
failures are a separate PlayHT LayerNorm shape-mismatch issue that
can be addressed independently now that the tests actually run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(neuralnetworks): align word2vec test shapes with softmax vocab head

word2vec's default constructor uses vocabsize=10000. the final layer emits
a 10000-dim softmax over the vocabulary, so per-sample output is [1, 10000],
not the [1, 1] implied by the base-class default. align input/output shape
so outputdimension_shouldmatchexpectedshape compares the right tensors.

* test(ner): emit 768-dim scaffolded shapes for transformer ner models

transformernerbase, spanbasedernbase, and the lstm-crf family all validate
token embeddings against their options.hiddendimension (768 by default, 100
for lstm-crf). the auto-scaffolded test base inherited [1, 4] as inputshape,
so multiheadattention threw "input embedding dimension (4) does not match
weight dimension (768)" before any downstream logic could run — the reported
scibertner training-error regression on pr #1156.

emit inputshape = [8, 768] for transformerner/spanbasedner and [8, 100] for
sequencelabelingner in the test scaffolder. add a manual tinybertnertests
with [8, 312] so the one model that overrides hiddendimension still gets
covered.

* fix(layers): default rnn head should use identityactivation, not relu-via-null

recurrent network's default layer stack terminated in a dense layer constructed
with activationfunction:null, which the dense ctor substitutes with relu. the
preceding two tanh recurrent layers produce small mixed-sign activations
(range ~[-0.16, 0.16] on random input), and relu then clips the single-output
regression head to exactly 0 for essentially any input. that is why
scaledinput_shouldchangeoutput and differentinputs_shouldproducedifferentoutputs
saw identical zero outputs for distinct inputs on recurrentneuralnetworktests.

pass an explicit identityactivation so the dense head stays linear. the
task-appropriate softmax/sigmoid activation layer emitted after it remains
unchanged.

* fix(memorynetwork): seed memory and wire training through the memory-aware flow

two root causes made every memorynetwork prediction identical regardless of
input, and the training path diverge from the prediction path:

1. _memory was initialized as a zero matrix. memoryreadlayer computes
   keys · memory^t, so with zero memory every attention score is zero,
   softmax produces a uniform distribution, and attentionweights · memory
   reads back zero — every subsequent layer saw the same constant
   vector. scaledinput_shouldchangeoutput and differentinputs_
   shouldproducedifferentoutputs both reported the network ignored its
   input. seed _memory with small xavier-scale random values so there is
   something non-trivial to attend over on the very first forward pass.

2. predict specialcased memoryread/memorywritelayer to pass the memory
   tensor and reshaped rank-1 input to [1, n], but train went through
   the base trainwithtape → forwardfortraining path which did neither,
   so training crashed ("tensormatmul requires tensors of rank >= 2")
   or silently read from an identity-memory fallback. factor the shared
   layer walk into runlayers() and override forwardfortraining so train
   and predict share the same memory plumbing.

locally memorynetworktests goes from 9 failing → 2 (the remaining two
are the known memoryreadlayer deserialization gap and
namedlayeractivations, tracked separately).

* fix(quantumnn): migrate training to trainwithtape and use identity on final dense

quantumneuralnetworktests was failing 10/17 because train called
_trainoptimizer.updateparameters(layers) without first running a backward
pass, tripping "backward pass must be called before updating parameters"
inside each dense layer's legacy per-learning-rate update path. switch
train to trainwithtape, matching resnet/vgg/mobilenetv2.

the quantum default layer stack also terminated its final dense in the
generator with activationfunction:null (→ relu), so regression-task
output got clipped at zero before the task-specific final activation
layer could run. promote that dense to identityactivation so the
subsequent activationlayer owns the non-linearity, same fix pattern as
the rnn regression head.

locally qnn goes from 10 failing → 5 (remaining five look like a
deeper input-independent forward pass — separate issue).

* fix(diffusion): upscaleavideo inputconv should match latent channels, not concat width

upscaleavideomodel set input_channels=8 to describe the "concat latent+low-res
conditioning" path from the reference paper, but forwardvideounet adds the
image condition via the _imagecondprojection dense layer *after* _inputconv,
not by concatenating before it. the first conv was therefore sized for 8
channels while ever actually seeing 4, and the 14 upscaleavideomodeltests
cases on the diffusion a-i shard all failed with "expected input depth 8,
but got 4".

pin input_channels to latent_channels so the conv weight shape matches what
the forward pass feeds it. this exposes a downstream film projection width
mismatch tracked separately (videounetpredictor.applyfilmconditioning) —
fixing that is the next step.

* fix(diffusion): videounet spatial resblock must mix channels, not width

createspatialresblock wrapped a lazydense(inchannels, outchannels), but
denselayer projects the *last* dimension of its input. for a 4d feature
map [b, c, h, w] that is the width axis, not the channel axis — so the
resblock silently scrambled width into outchannels while leaving the
channel count untouched. the next timecondprojection was sized for the
planned outchannels, so applyfilmconditioning saw "expected 2*c, got
2*outc" and threw "film conditioning projection width mismatch: expected
640, got 1280" across upscaleavideo and streamingt2v tests.

switch to a 1x1 lazyconv2d — the standard channel-mixing primitive. it
consumes [b, inchannels, h, w] and produces [b, outchannels, h, w]
without touching spatial dims, so downstream film projections receive a
feature map with the channel count they were sized for.

follow-ups (separate): multihead attention, temporal attention, and
cross-attention layers still receive the 4d tensor directly without
reshape, which surfaces as input-dim mismatches further down the
forward pass.

* fix(serialization): register memoryread and memorywrite layers for deserialization

clone()-style roundtrips on memorynetwork crashed with "layer type
memoryreadlayer is not supported for deserialization (no known constructor
found)" because deserializationhelper.createlayerfromtype had no explicit
arm for either memoryread or memorywrite layer, and the default
fallback tries a ctor(int[]) that neither layer exposes.

add cases for both. memoryreadlayer uses a
(inputdim, memorydim, outputdim, iactivation) ctor and memorywritelayer
uses (inputdim, memorydim, iactivation). pick memorydim from a
"memorydimension" metadata key when present, otherwise reuse the output
dim — which matches how memorynetwork wires its memoryreadlayer
(embeddingsize for all three dims).

* fix(gp): sparsegp ky solve falls back to svd pseudoinverse when cholesky gives up

sparsegaussianprocess.fit builds ky = kuu + d·kuf·kuf^t and factors it via
cholesky. in exact arithmetic ky is psd (not pd) whenever
rank(d·kuf·kuf^t) < m — the common regime where inducing points equal the
data dimensionality — and floating-point roundoff then pushes the smallest
eigenvalue just below zero, so choleskydecomposition throws "matrix is
not positive definite". the earlier escalating jitter schedule (1e-6 →
1e-4 → 1e-2 → 1e-1 of the trace) was still losing on the ci shard, leaving
7 sparsegaussianprocesstests failing.

keep the cholesky + jitter escalation as the primary path for performance,
then fall back to an svd moore-penrose pseudoinverse when no jitter level
makes ky pd. the pseudoinverse truncates singular values below
max(rows, cols) · ε_machine · σ_max, which is numpy.linalg.pinv's default
tolerance, and produces a well-defined α even when d·kuf·kuf^t has a
near-null space.

locally sparsegaussianprocesstests: 7 failing → 16/16 passing.

* fix(regression): poisson irls must not overwrite coefficients with nan/inf

predictions_shouldbefinite and collinearfeatures_shouldnotcrash both
failed on net10 because the irls step in poissonregression.train can
produce a newcoefficients vector with nan entries when x^t·w·x is
numerically singular (the solve with qr/svd doesn't always refuse the
factorization — it sometimes just hands back 1/0 or 0/0). the loop then
assigned those nan values into coefficients and intercept, and every
subsequent predictmean call propagated nan through the linear predictor.

check for non-finite entries before accepting the step and halt
iteration instead, preserving the last known-good coefficients. matches
statsmodels glm's "linearalgerror" abort.

locally poissonregressiontests: 20/22 → 21/22 (the remaining
moredata_shouldnotdegrade_r2 is a separate convergence issue).

* fix(regression): rbf solve via tikhonov-damped svd instead of normal-equations inverse

rbf design matrices are often severely ill-conditioned — when a handful
of centers end up far from every input, the corresponding columns go to
near-zero and x^t·x has a huge condition number. the previous solve
inverted x^t·x + λi directly via matrix.inverse(), which amplified
roundoff into nan predictions (predictions_shouldbefinite,
singlefeature_shouldwork, collinearfeatures_shouldnotcrash) and
catastrophic negative r² (r2_shouldbepositive_onlineardata saw
r² ≈ -10¹²).

replace with a tikhonov-regularized svd solve on x directly:
  weights = v · diag(σ / (σ² + λ²)) · uᵀ · y
with λ = 1e-6 · σ_max. this smoothly damps the ill-conditioned
directions instead of zeroing them (which a hard-tolerance pseudoinverse
would, dropping real signal along with roundoff) and avoids forming
the normal-equations matrix that was the source of the explosion.

locally rbfregression: nan predictions cleared, r² on linear data
improved by 11+ orders of magnitude (from ~-10¹² to single-digit
negative). a couple of r²-positivity tests still fail — likely
center-placement / gamma choice, separate improvement — but the
nan-poisoning is gone.

* fix: address 10 CodeRabbit review comments on PR #1156

- AesGcmModelArtifactProtector.SanitizeFileName: reject Windows DOS
  reserved device names (CON/PRN/AUX/NUL/COM1-9/LPT1-9) and trim trailing
  dot/space characters. Previously portable-artifact guarantee failed on
  names like "CON.bin" or "model." — now prefixed with '_' and trimmed so
  artifacts created on POSIX hosts still mount on Windows.
- DiTNoisePredictor.ForwardBlock + FinalLayerWithAdaLN: guard against
  misconfigured AdaLN modulation output sizes. If modulation.Length isn't
  divisible by 6 * _hiddenSize (or 2 * _hiddenSize for final layer),
  throw InvalidOperationException with a clear diagnostic rather than
  letting integer division truncate silently and Engine.Reshape throw a
  cryptic shape-mismatch error downstream.
- RobustFileOpsMoveRetryTests: renamed
  Move_SucceedsAfter_TransientSharingViolation → ...TransientMissingParentDirectory
  and Move_Propagates_WhenLockNeverReleases → ...WhenParentDirectoryNeverCreated
  so the test names match the actual cross-platform retry trigger (missing
  destination parent directory, not lock/share violation which doesn't
  work on Linux). Fixed XML-doc reference from IOException → DirectoryNotFoundException.
- PredictionStats.CalculatePredictionStats: reuse R2 + AdjustedR2 already
  computed eagerly in the constructor with identical inputs, instead of
  recalculating them in the lazy-compute path. Cuts two O(n) scans.
- NeuralNetworkBase: new protected PromoteToBatchedTensor + EnsureBatchForCnnTraining
  helpers. Extracted from the duplicated 4-line rank-3 → rank-4 input
  expansion pattern that ResNet/VGG/MobileNetV2/ConvolutionalNeuralNetwork
  all carried individually. Subclasses' Train() now delegates to the base
  helper and removes their private AddBatchDimension copies.
  (Name differs from per-subclass AddBatchDimension to avoid CS0108
  hides-inherited warnings on 10+ segmentation subclasses that keep their
  own local helpers for non-CNN-training paths.)

Verify:
- src build net10.0 — 0 errors
- tests build net10.0 — 0 errors
- Tensors 0.46.1 confirmed published on NuGet

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: franklinic <franklin@ivorycloud.com>
ooples added a commit that referenced this pull request Apr 19, 2026
…#1156)

* fix(stats): break BasicStats.CalculateStats recursion that crashed test host

BasicStats's lazy-stats accessors all read through property getters that
call EnsureFullStatsComputed -> CalculateStats. When CalculateStats
itself reads any of those properties (N, Mean, Variance,
StandardDeviation, Median, FirstQuartile, ThirdQuartile), the getter
re-enters EnsureFullStatsComputed because _fullStatsComputed is still
false during the body of CalculateStats — that flag is only set after
CalculateStats returns. The result is unbounded recursion that crashes
the xUnit test host with a StackOverflowException.

Stack from CI failures:
  BasicStats<double>.CalculateStats(Vector<double>)
  BasicStats<double>.EnsureFullStatsComputed()
  BasicStats<double>.get_N()                       // <-- re-entry
  BasicStats<double>.CalculateStats(Vector<double>)
  ...

Reported as the "Test Run Aborted — host process exited unexpectedly"
on these CI jobs (PR #1154 / master):
  - AiDotNet.Serving.Tests
  - ModelFamily - Classification
  - ModelFamily - Clustering/GP
  - ModelFamily - Regression
  - ModelFamily - TimeSeries/Activation/Loss
  - Unit - 04 Feature/Fit/Fitness/Genetics

Fix: compute every intermediate value into a local variable, only
assign to the publicly-observable properties at the end. Property reads
never happen inside CalculateStats, so the lazy getter never re-enters.

Verified locally: FederatedRun_Lifecycle_FedAvg_AggregatesAndAdvancesRound
(which serializes a model and triggers the lazy stats path) now passes
end-to-end instead of crashing the host.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* test(data): cross-platform retry trigger for RobustFileOps tests

Two RobustFileOps retry tests passed on Windows but failed on the Linux
CI runner because FileShare.None on a FileStream does not actually
block File.Move on POSIX:

  - Move_SucceedsAfter_TransientSharingViolation
  - Move_Propagates_WhenLockNeverReleases

Both used a held FileStream with FileShare.None as the
"failed-attempt" trigger. On Linux that does not block rename(2), so
File.Move succeeded on the first attempt — Move_Propagates' Assert.
Throws fired ("No exception was thrown") and Move_SucceedsAfter
short-circuited without ever exercising the retry loop.

Replaced the lock-based simulation with a cross-platform missing-
parent-directory trigger:

  - Move_SucceedsAfter_TransientSharingViolation: destination's parent
    directory does not exist when MoveWithRetryAsync runs. File.Move
    throws DirectoryNotFoundException (an IOException subclass) on
    each attempt. A background task creates the parent ~250 ms in,
    so a subsequent attempt succeeds. Retry path is exercised on
    every platform.
  - Move_Propagates_WhenLockNeverReleases: parent directory is never
    created. Every attempt throws DirectoryNotFoundException; the
    final attempt must propagate. Test now asserts the more specific
    DirectoryNotFoundException type for clarity, and adds a check
    that the source file is still in place after the failed move
    (the move never started, so src must remain).

Verified locally: all 5 RobustFileOpsMoveRetryTests pass on net10.0.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* fix(serialization): match MultiHeadAttentionLayer 5-arg constructor in deserializer

DeserializationHelper.CreateMultiHeadAttentionLayer was looking up a
4-parameter constructor signature

  (int, int, int, IActivationFunction<T>)

but MultiHeadAttentionLayer<T>'s constructor is actually 5-parameter:

  (int, int, int, IActivationFunction<T>?, IInitializationStrategy<T>?)

Type.GetConstructor matches by exact parameter list, not by "first N
plus defaults," so the lookup returned null and threw

  "Cannot find MultiHeadAttentionLayer constructor with
   (int, int, int, IActivationFunction<T>)"

Failure path observed in CI:
  - InferenceOptimizer.OptimizeForInference(model, cloneModel: true)
    -> NeuralNetworkBase.Clone (serialization round-trip)
      -> DeserializationHelper.CreateMultiHeadAttentionLayer (throws)
    -> caught in OptimizeForInference, returns (model, false)
  - Test InferenceOptimizer_RewritesMultiHeadAttention_To
    CachedAttention_ForTextGeneration_WhenKVCacheEnabled then sees
    anyApplied == false instead of the expected rewrite.

The fix mirrors how CreateDenseLayer already passes
IInitializationStrategy<T> in its constructor lookup. Pass null for
the strategy slot, matching the constructor's default-value semantics.

Verified locally: all 9 InferenceOptimizerTests pass on net10.0.

Wider impact: this also unblocks Clone-via-serialization for any model
containing MHA layers — previously every transformer-style model would
silently skip inference optimizations after clone failed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* fix(optimizer): re-allocate Adam moments when cached shape mismatches param

AdamOptimizer.Step keyed its per-parameter moment tensors (_tapeM,
_tapeV) by Tensor reference. If a parameter was first seen while a
lazy-initialized layer (e.g. MultiHeadAttentionLayer with
IsLazy: true initialization strategy) had its weights allocated as
the placeholder [0, 0] tensor, the cached m / v captured shape
[0, 0] and Length 0. Once the layer materialized real weights and
real-shape gradients arrived, mScaled and gradScaled differed in
shape; TensorAdd broadcast to the larger shape and the result no
longer matched m's underlying buffer.

Fix: at every Step, validate the cached m and v match the parameter's
current shape via SequenceEqual, and re-allocate if not. Identity
caching by reference still works for stable parameters; the explicit
shape check covers the lazy-init case.

Note: this fix alone is not sufficient to make
MobileNetV3_Train_CompletesWithoutError pass — that test also hits a
separate bug in AiDotNet.Tensors (CpuEngine.TensorCopy uses
sourceArray.Length instead of source.Length, see follow-up PR on the
Tensors repo). This commit fixes the lazy-init half of the issue,
which would otherwise mask the Tensors bug behind a noisier symptom.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* fix(serving): cross-platform sanitizer for AesGcm artifact filenames

Path.GetInvalidFileNameChars returns a platform-specific set:
  - Windows: includes ':', '\', '*', '?', '<', '>', '|', '"' plus
    control chars 1-31
  - Linux / macOS: only '\0' and '/'

Encrypted model artifacts are designed to be portable across operating
systems (an artifact written on a Linux training cluster might be
loaded on a Windows inference host). Using the platform-specific set
broke the AesGcmModelArtifactProtectorTests.
ProtectToFile_WritesHeaderAndReturnsArtifact test on Linux CI:
  expected "my_model.aidn.enc"
  actual   "my:model.aidn.enc"   (':' isn't invalid on POSIX)

Fix: replace Path.GetInvalidFileNameChars with a hardcoded
cross-platform-invalid set that combines the Windows superset with
POSIX. Now the sanitizer produces identical output on every OS, so
artifacts are guaranteed mountable everywhere.

Verified locally: ProtectToFile_WritesHeaderAndReturnsArtifact passes
on net10.0.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

* fix(layers): sparselinearlayer reports supportstraining true

The layer's SupportsTraining property previously returned false with a
detailed comment explaining that sparse weight tensors don't fit the
tape's dense ParameterBuffer<T> contract. But returning false was
incorrect: SupportsTraining gates the LEGACY non-tape training path
(`if (layer.SupportsTraining) layer.UpdateParameters(lr)`), and the
layer DOES have a working UpdateParameters that updates both the
sparse weight tensor and the dense bias vector from gradients
computed in Backward. Setting it to false was preventing the layer
from training in the legacy path even though the update mechanism
existed.

Tape-mode discovery is unaffected by SupportsTraining — that path
uses [TrainableParameter] / RegisterTrainableParameter discovery, not
this property. The sparse weight tensor remains invisible to tape
mode pending sparse-aware ParameterBuffer<T> support, which is a
separate architectural follow-up.

Updated docstring to describe the actual semantics (legacy path
trains the layer; tape-mode caveat documented inline).

Verified locally: SparseLinearLayer_SupportsTraining_IsTrue passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(dit): vectorize Patchify/Unpatchify/AdaLN via Engine reshape+permute

Replaces the scalar nested-loop implementations of Patchify, Unpatchify,
ReshapeForHeads, ReshapeFromHeads, and the ExtractModulation/ApplyAdaLN/
AddWithGate helpers with their Engine-op equivalents — reshape + permute +
reshape pipelines and zero-copy TensorSliceAxis views off the AdaLN
modulation tensor.

Specific changes:

  * Patchify/Unpatchify: replace the 6-deep scalar nested loop with
    Engine.Reshape → Engine.TensorPermute → Engine.Reshape. The permute
    runs through the engine's vectorized memcpy kernel (or stays as a
    view when the downstream consumer supports strided) instead of a
    per-element C# scalar copy.

  * ReshapeForHeads/FromHeads: same pattern (reshape + permute + reshape)
    instead of the original triple-nested scalar copy with span slices.

  * ExtractModulation eliminated entirely. Previously ForwardBlock did 6
    ExtractModulation calls per block (24 blocks × 50 inference steps ×
    6 = 7200 T[] allocations per Predict). Now ForwardBlock reshapes the
    AdaLN modulation output to [B, 6, 1, H] once and slices out each
    shift/scale/gate via Engine.TensorSliceAxis — zero allocations, zero
    scalar fill loops.

  * ApplyAdaLN / AddWithGate rewritten to accept Tensor<T> broadcast
    views (from TensorSliceAxis) instead of T[] scalar arrays. The
    previous implementations built a [1,1,H] broadcast tensor via
    TensorAllocator.Rent + a per-element scalar fill; the new ones use
    Engine.TensorAddScalar / Engine.TensorBroadcastMultiply / Engine.
    TensorBroadcastAdd directly on the sliced views.

  * EmbedPatches / FinalLayerWithAdaLN: replaced the
    TensorAllocator.Rent + CopyTo scratch-buffer round trips with
    Engine.Reshape view chains (the downstream dense forward is
    contiguous-input-tolerant).

Every hot-path scalar copy in DiT forward is now either a view
(zero-copy) or a SIMD-vectorized engine op. Depends on the matching
AiDotNet.Tensors PR #196 for the double-precision SIMD fallbacks in
TensorMatMul / ScaledDotProductAttention / FusedLinear / broadcast ops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(init): batched parallel Xavier normal weight initialization

Replaces the per-element SampleGaussian call loop (which ran a
virtual-dispatch Box-Muller + rejection test for every element) with a
tight specialized fill routine for double and float: one paired
Box-Muller transform produces two samples per pair of uniform draws,
halving the log/sqrt/sin/cos call count, and large layers (≥ 256K
elements) are partitioned across the thread pool so the ~29s of init
cost per DiT-XL-sized Dense layer (hidden 8192 × out 12288 = 100M
doubles per AdaLN modulation layer) is parallelized instead of running
single-threaded.

Context: even after the Tensors-side SIMD fixes on the forward matmul
path, the first Pika21 Predict paid ~150s of lazy-init overhead across
the 24 block layers because each first-call XavierNormalInitialize hit
a scalar loop doing 100M virtual calls. The cost is one-time per layer
but it dominated the first forward and pushed Training_Should* tests
that exercise a fresh model over the per-test xUnit budget.

Preserves reproducibility: per-chunk RNGs are seeded deterministically
from the master Random instance, so for a given parent seed the output
is stable across thread counts. Keeps the generic-T fallback on the
old path since only float/double are expected to be perf-critical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump aidotnet.tensors 0.46.0 -> 0.46.1

Pulls in the Tensors SIMD fallback fixes from Tensors PR #196:
  - TensorMatMul double fallback routed through MultiplyBlocked
  - ScaledDotProductAttention double SIMD fast path
  - FusedGemmBiasActivation double fallback SIMD-routed
  - TensorBroadcast{Multiply,Add} trailing-repeat fast path
  - Odometer-based Contiguous() materialization
  - LayerNorm generic fallback uses SIMD numOps.Sum

Unblocks the DiT vectorization work in this PR — every
double-precision matmul / broadcast / attention op it relies on now
hits a SIMD path instead of a scalar triple-loop.

Also unblocks MobileNetV3_Train_CompletesWithoutError which hit the
TensorCopy source.Length regression (Tensors PR #195, included in
0.46.1 via #194's follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(stats): break EnsureFullStatsComputed recursion in errorstats/modelstats/predictionstats

Same bug class as the earlier BasicStats fix: the Calculate* method was
assigning to properties AND reading them back during its own body, but
the property getters call EnsureFullStatsComputed — which is still
running the Calculate* method. The _fullStatsComputed flag only flips
after Calculate* returns, so any intra-method property read re-enters
Calculate* unbounded. The test host crashes with StackOverflowException
before the test framework can report anything except "host process
exited unexpectedly."

Specific re-entry points the previous code had:

  * ErrorStats.CalculateErrorStats
    - RMSE = _numOps.Sqrt(MSE)              ← re-enters via MSE getter
    - AIC/BIC/AICAlt pass RSS                ← re-enters via RSS getter

  * ModelStats.CalculateModelStats
    - VIFList = ... CalculateVIF(CorrelationMatrix, ...) ← CorrelationMatrix
    - Mahalanobis block reads CovarianceMatrix thrice  ← CovarianceMatrix

  * PredictionStats.CalculatePredictionStats
    - AdjustedR2 = ... CalculateAdjustedR2(R2, ...)         ← R2
    - PredictionIntervalCoverage = ... (PredictionInterval.Lower,
      PredictionInterval.Upper)                             ← PredictionInterval
    - ConfidenceInterval/CredibleInterval read BestDistributionFit
      .DistributionType                                     ← BestDistributionFit

All three methods are rewritten to compute every intermediate into a
local variable first; properties are only assigned once every dependency
is a local. No property reads happen inside Calculate*, so the lazy
getter never re-enters.

Observed failure path (Classification CI shard, PR #1156 run):
  AdaBoostClassifierTests.Predict_ShouldBeDeterministic trains the
  model, which computes ErrorStats, which stack-overflows the host.
  Other crashed tests in the same shard:
    - ExtraTreesClassifierTests.Clone_ShouldProduceIdenticalPredictions
    - CategoricalNaiveBayesTests.Builder_AccuracyShouldBeatChance
    - OneVsRestClassifierTests.Builder_AccuracyShouldBeatChance
  All 4 pass locally after this fix.

Unblocks the host_crash jobs on PR #1154 triage:
  - ModelFamily - Classification
  - ModelFamily - Clustering/GP
  - ModelFamily - Regression
  - ModelFamily - TimeSeries/Activation/Loss
  - Unit - 04 Feature/Fit/Fitness/Genetics
  - AiDotNet.Serving.Tests

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(networks): resnet/vgg train adds batch dim for 3d input

ResNet/VGG's Forward() explicitly accepts 3D [C,H,W] input and expands
it to 4D [1,C,H,W] before running the layer stack. Their Train()
overrides, however, called TrainWithTape directly — which delegates to
NeuralNetworkBase.ForwardForTraining, which does NOT add a batch dim
and just runs the raw tensor through every layer.

For a 3D input [3, 32, 32], the conv/pool chain preserves the rank-3
shape and the classifier's AdaptiveAveragePool + Flatten ends up
producing [512, 1] (the 512 final-block channel count gets treated as
a batch dim by FlattenLayer.Forward's "preserve first dim" rule). The
final DenseLayer with inputSize=512 sees actualInputSize=1 via
input.Shape[^1], calls EnsureWeightShapeForInput(1) which resizes
weights to [1, 10], and produces [512, 10] — which then fails the
loss shape check in EnsureTargetMatchesPredicted because the target
is [10].

Fix: mirror Forward()'s expansion in Train() — when input is 3D, add
a leading batch dim to BOTH input and target before dispatching to
TrainWithTape. Any 4D input is passed through untouched. The target
expansion is guarded so a caller that already provided a batched
target is not double-expanded.

Verified locally, all 4 of the previously-failing tests now pass:
  - ResNetNetwork_Train_CompletesWithoutError
  - ResNetNetwork_Train_LossDecreases
  - VGGNetwork_Train_CompletesWithoutError
  - VGGNetwork_Train_LossDecreases

Closes the 08a NN-Classic (ResNet/VGG/DenseNet) CI shard failure from
the PR #1154 triage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(networks): mobilenetv2 handles 3d input in forward/train/namedactivations

Same structural bug as ResNet/VGG: MobileNetV2's Forward / Train /
GetNamedLayerActivations all iterated the layer stack with the raw
input. For 3D [C, H, W] inputs, BatchNormalizationLayer's channel
scale (shape [1, C, 1, 1]) cannot broadcast against the 3D layout
because dim 1 of the input (spatial H) doesn't match the BN's C
channel count:
  "Tensors with shapes [16, 32, 32] and [1, 16, 1] cannot be broadcast:
   dimension 1 has sizes 32 and 16 (must be equal or one must be 1)."

Fix: add a leading batch dimension when the caller passes a 3D input
so every BN in every InvertedResidualBlock sees the 4D layout it
requires, and squeeze it back off at the end of Forward so the output
shape matches the caller's 3D contract. Train() expands both input
and target the same way so ForwardForTraining (which iterates layers
without adding batch dim) also sees the correct shape.
GetNamedLayerActivations is overridden with the same expansion so the
layer-by-layer probe used by NamedLayerActivations_ShouldBeNonEmpty
doesn't hit the same BN broadcast error.

Also fixes the test: the parameterless MobileNetV2Network constructor
defaults to 1000 ImageNet classes and 224x224 input; the test probed
with 3x64x64 and 10-class OutputShape. Swap in the architecture-aware
overload so the classifier head matches the expected output dim.

Goes from 0/17 passing on the previous config to 14/17 passing — the
three remaining failures are a deeper shape-collapse issue inside the
InvertedResidualBlock chain for the NamedLayerActivations probe and a
perf timeout on the training tests, both of which are separate from
this broadcast-shape root cause.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(networks): instructorembedding test shape matches 768-dim model

InstructorEmbedding's default ctor builds a 768-dim transformer
(inputSize=768, outputSize=768) but the test inherited the base
class's default InputShape=[1, 4] and OutputShape=[1, 1]. The training
tests fed a [1, 4] input to a 768-dim model and a [1, 1] target that
the loss function then tried to subtract from the model's [1, 768]
prediction, throwing "Tensor shapes must match. Got [1, 768] and
[1, 1]." in MeanSquaredErrorLoss.ComputeTapeLoss.

Fix: override InputShape/OutputShape to the model's actual 768-dim
embedding layout so input, prediction, and target all align.

Closes the InstructorEmbedding part of the "ModelFamily - NeuralNetworks"
CI shard failure from the PR #1154 triage (remaining failures in that
shard are MobileNetV2 and are addressed in the previous commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(networks): convolutionalneuralnetwork train adds batch dim for 3d input

Same 3D-input bug as ResNet/VGG/MobileNetV2: CNN's Train() called
TrainWithTape with the raw 3D [C, H, W] tensor. ForwardForTraining
iterates layers without a shape-adjustment step, so the final
FlattenLayer treats the 32-channel dimension as a batch
(preserve-first-dim rule) and produces a [32, 10] prediction against
a [10] one-hot target — fails EnsureTargetMatchesPredicted with
"Target shape dimension 0 (10) does not match predicted shape
dimension 0 (32)."

Fix: expand 3D input to 4D before dispatching to TrainWithTape, and
expand the target too when the caller provided it without a batch dim.

All 5 previously-failing CNN tests pass locally:
  - TrainingError_ShouldNotExceedTestError
  - Training_ShouldReduceLoss
  - Training_ShouldChangeParameters
  - GradientFlow_ShouldBeNonZeroAndFinite
  - ForwardPass_ShouldBeFinite_AfterTraining

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(networks): unet3d decoder channel count + test output shape

Two related problems surfaced by every UNet3D test:

1. LayerHelper.CreateDefaultUNet3DLayers — the decoder path declared
   the first Conv3D of each non-bottleneck-adjacent block with
   `inChannels = encoderFilters[block + 1] * 2`. The "*2" was there to
   account for a full U-Net concatenating skip connections from the
   encoder at each decoder level. This implementation does NOT
   actually perform the concatenation, so the preceding decoder
   block's Second-Conv3D emitted encoderFilters[block + 1] channels,
   not double that. Every CI call (and every local Predict) hit
   "Input channels (128) must match kernel in_channels (256)" in the
   first decoder block after the one adjacent to the bottleneck.

   Fix: drop the "*2" so the declared in_channels match the tensors
   that actually flow through. Concatenating real skip connections is
   a separate architectural improvement.

2. UNet3DTests — OutputShape declared as [1], treating the network as
   a classifier, but UNet3D is a per-voxel segmentation model whose
   final 1x1x1 Conv3D emits [numClasses, D, H, W] per sample. With
   default numClasses=1 and 32³ voxel grid, every training test tried
   to subtract a [1, 32, 32, 32] prediction from a [1] target and
   threw "Tensor shapes must match. Got [1, 32, 32, 32] and [1]."

   Fix: OutputShape → [1, 32, 32, 32] so input, prediction, and
   target all line up.

Goes from 0/17 passing on UNet3D to 12/17. The five remaining
failures are separate issues (NaN during training for this conv stack,
metadata parity) that are independent of these two root causes.

Closes 7 of the 8 UNet3D failures from the PR #1154 CI triage that
were all attributed to the "Input channels (128) vs (256)" error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gp): escalating cholesky jitter for sparsegaussianprocess.fit

Ky = Kuu + D·Kuf·Kuf^T is only positive-semi-definite in exact
arithmetic, so floating-point roundoff on the combined matrix
routinely pushes the smallest eigenvalue just below zero and
CholeskyDecomposition throws "Matrix is not positive definite" on
every SparseGaussianProcess fit. Kuu already gets a constant 1e-4
jitter before its Cholesky, but the Ky path had none — that produced
the six SparseGaussianProcessTests failures in the PR #1156 CI shard.

Add a PyTorch/GPyTorch-style escalating jitter schedule (1e-6 →
1e-4 → 1e-2 → 1e-1, scaled by the matrix trace so it's invariant to
kernel amplitude) and retry the Cholesky after each increment.
Geometric escalation instead of a single larger constant keeps the
numerical error introduced for already-well-conditioned matrices
minimal while still rescuing the borderline cases.

Goes from 7/16 passing to 14/16 on SparseGaussianProcessTests.
Remaining two failures are separate bugs (predictive mean is NaN,
not a PD-matrix issue) tracked independently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(generators): correct audio/video modeldomain ordinal in testscaffoldgenerator

ModelDomain enum order is General=0, Vision=1, Language=2, Audio=3,
Video=4, Multimodal=5. The scaffold generator had Audio and Video
ordinals swapped in three places:

  1. Line 1495 — treats Domain=3 as "temporal video" and emits
     `throw new NotImplementedException(...)` in the test's
     CreateNetwork. Audio is 3, not 4, so EVERY audio model
     (PlayHT, Bark, StableAudio, etc.) got a NotImplementedException
     factory instead of a working architecture. Ten PlayHTTests
     failures on PR #1156 traced back to this single line.

  2. Line 1520 — `isAudio = Domains.Contains(4)`. Should be 3.

  3. Line 1633 — `isVideoModel = Domains.Contains(3)`. Should be 4.

All three sites now use the correct ordinals (Audio=3, Video=4).

This aligns the generator with the enum and the facade/customization
pattern the project prefers over hard-coded factories — every audio
model's test can now construct a real Architecture and run the test
body (which exposes the real model-specific failures downstream,
where they can be fixed in the model code rather than hidden behind
a runtime factory stub).

PlayHTTests go from 0/21 passing (all NotImplementedException) to
2/21 (metadata/parameter-count tests now execute). The remaining 19
failures are a separate PlayHT LayerNorm shape-mismatch issue that
can be addressed independently now that the tests actually run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(neuralnetworks): align word2vec test shapes with softmax vocab head

word2vec's default constructor uses vocabsize=10000. the final layer emits
a 10000-dim softmax over the vocabulary, so per-sample output is [1, 10000],
not the [1, 1] implied by the base-class default. align input/output shape
so outputdimension_shouldmatchexpectedshape compares the right tensors.

* test(ner): emit 768-dim scaffolded shapes for transformer ner models

transformernerbase, spanbasedernbase, and the lstm-crf family all validate
token embeddings against their options.hiddendimension (768 by default, 100
for lstm-crf). the auto-scaffolded test base inherited [1, 4] as inputshape,
so multiheadattention threw "input embedding dimension (4) does not match
weight dimension (768)" before any downstream logic could run — the reported
scibertner training-error regression on pr #1156.

emit inputshape = [8, 768] for transformerner/spanbasedner and [8, 100] for
sequencelabelingner in the test scaffolder. add a manual tinybertnertests
with [8, 312] so the one model that overrides hiddendimension still gets
covered.

* fix(layers): default rnn head should use identityactivation, not relu-via-null

recurrent network's default layer stack terminated in a dense layer constructed
with activationfunction:null, which the dense ctor substitutes with relu. the
preceding two tanh recurrent layers produce small mixed-sign activations
(range ~[-0.16, 0.16] on random input), and relu then clips the single-output
regression head to exactly 0 for essentially any input. that is why
scaledinput_shouldchangeoutput and differentinputs_shouldproducedifferentoutputs
saw identical zero outputs for distinct inputs on recurrentneuralnetworktests.

pass an explicit identityactivation so the dense head stays linear. the
task-appropriate softmax/sigmoid activation layer emitted after it remains
unchanged.

* fix(memorynetwork): seed memory and wire training through the memory-aware flow

two root causes made every memorynetwork prediction identical regardless of
input, and the training path diverge from the prediction path:

1. _memory was initialized as a zero matrix. memoryreadlayer computes
   keys · memory^t, so with zero memory every attention score is zero,
   softmax produces a uniform distribution, and attentionweights · memory
   reads back zero — every subsequent layer saw the same constant
   vector. scaledinput_shouldchangeoutput and differentinputs_
   shouldproducedifferentoutputs both reported the network ignored its
   input. seed _memory with small xavier-scale random values so there is
   something non-trivial to attend over on the very first forward pass.

2. predict specialcased memoryread/memorywritelayer to pass the memory
   tensor and reshaped rank-1 input to [1, n], but train went through
   the base trainwithtape → forwardfortraining path which did neither,
   so training crashed ("tensormatmul requires tensors of rank >= 2")
   or silently read from an identity-memory fallback. factor the shared
   layer walk into runlayers() and override forwardfortraining so train
   and predict share the same memory plumbing.

locally memorynetworktests goes from 9 failing → 2 (the remaining two
are the known memoryreadlayer deserialization gap and
namedlayeractivations, tracked separately).

* fix(quantumnn): migrate training to trainwithtape and use identity on final dense

quantumneuralnetworktests was failing 10/17 because train called
_trainoptimizer.updateparameters(layers) without first running a backward
pass, tripping "backward pass must be called before updating parameters"
inside each dense layer's legacy per-learning-rate update path. switch
train to trainwithtape, matching resnet/vgg/mobilenetv2.

the quantum default layer stack also terminated its final dense in the
generator with activationfunction:null (→ relu), so regression-task
output got clipped at zero before the task-specific final activation
layer could run. promote that dense to identityactivation so the
subsequent activationlayer owns the non-linearity, same fix pattern as
the rnn regression head.

locally qnn goes from 10 failing → 5 (remaining five look like a
deeper input-independent forward pass — separate issue).

* fix(diffusion): upscaleavideo inputconv should match latent channels, not concat width

upscaleavideomodel set input_channels=8 to describe the "concat latent+low-res
conditioning" path from the reference paper, but forwardvideounet adds the
image condition via the _imagecondprojection dense layer *after* _inputconv,
not by concatenating before it. the first conv was therefore sized for 8
channels while ever actually seeing 4, and the 14 upscaleavideomodeltests
cases on the diffusion a-i shard all failed with "expected input depth 8,
but got 4".

pin input_channels to latent_channels so the conv weight shape matches what
the forward pass feeds it. this exposes a downstream film projection width
mismatch tracked separately (videounetpredictor.applyfilmconditioning) —
fixing that is the next step.

* fix(diffusion): videounet spatial resblock must mix channels, not width

createspatialresblock wrapped a lazydense(inchannels, outchannels), but
denselayer projects the *last* dimension of its input. for a 4d feature
map [b, c, h, w] that is the width axis, not the channel axis — so the
resblock silently scrambled width into outchannels while leaving the
channel count untouched. the next timecondprojection was sized for the
planned outchannels, so applyfilmconditioning saw "expected 2*c, got
2*outc" and threw "film conditioning projection width mismatch: expected
640, got 1280" across upscaleavideo and streamingt2v tests.

switch to a 1x1 lazyconv2d — the standard channel-mixing primitive. it
consumes [b, inchannels, h, w] and produces [b, outchannels, h, w]
without touching spatial dims, so downstream film projections receive a
feature map with the channel count they were sized for.

follow-ups (separate): multihead attention, temporal attention, and
cross-attention layers still receive the 4d tensor directly without
reshape, which surfaces as input-dim mismatches further down the
forward pass.

* fix(serialization): register memoryread and memorywrite layers for deserialization

clone()-style roundtrips on memorynetwork crashed with "layer type
memoryreadlayer is not supported for deserialization (no known constructor
found)" because deserializationhelper.createlayerfromtype had no explicit
arm for either memoryread or memorywrite layer, and the default
fallback tries a ctor(int[]) that neither layer exposes.

add cases for both. memoryreadlayer uses a
(inputdim, memorydim, outputdim, iactivation) ctor and memorywritelayer
uses (inputdim, memorydim, iactivation). pick memorydim from a
"memorydimension" metadata key when present, otherwise reuse the output
dim — which matches how memorynetwork wires its memoryreadlayer
(embeddingsize for all three dims).

* fix(gp): sparsegp ky solve falls back to svd pseudoinverse when cholesky gives up

sparsegaussianprocess.fit builds ky = kuu + d·kuf·kuf^t and factors it via
cholesky. in exact arithmetic ky is psd (not pd) whenever
rank(d·kuf·kuf^t) < m — the common regime where inducing points equal the
data dimensionality — and floating-point roundoff then pushes the smallest
eigenvalue just below zero, so choleskydecomposition throws "matrix is
not positive definite". the earlier escalating jitter schedule (1e-6 →
1e-4 → 1e-2 → 1e-1 of the trace) was still losing on the ci shard, leaving
7 sparsegaussianprocesstests failing.

keep the cholesky + jitter escalation as the primary path for performance,
then fall back to an svd moore-penrose pseudoinverse when no jitter level
makes ky pd. the pseudoinverse truncates singular values below
max(rows, cols) · ε_machine · σ_max, which is numpy.linalg.pinv's default
tolerance, and produces a well-defined α even when d·kuf·kuf^t has a
near-null space.

locally sparsegaussianprocesstests: 7 failing → 16/16 passing.

* fix(regression): poisson irls must not overwrite coefficients with nan/inf

predictions_shouldbefinite and collinearfeatures_shouldnotcrash both
failed on net10 because the irls step in poissonregression.train can
produce a newcoefficients vector with nan entries when x^t·w·x is
numerically singular (the solve with qr/svd doesn't always refuse the
factorization — it sometimes just hands back 1/0 or 0/0). the loop then
assigned those nan values into coefficients and intercept, and every
subsequent predictmean call propagated nan through the linear predictor.

check for non-finite entries before accepting the step and halt
iteration instead, preserving the last known-good coefficients. matches
statsmodels glm's "linearalgerror" abort.

locally poissonregressiontests: 20/22 → 21/22 (the remaining
moredata_shouldnotdegrade_r2 is a separate convergence issue).

* fix(regression): rbf solve via tikhonov-damped svd instead of normal-equations inverse

rbf design matrices are often severely ill-conditioned — when a handful
of centers end up far from every input, the corresponding columns go to
near-zero and x^t·x has a huge condition number. the previous solve
inverted x^t·x + λi directly via matrix.inverse(), which amplified
roundoff into nan predictions (predictions_shouldbefinite,
singlefeature_shouldwork, collinearfeatures_shouldnotcrash) and
catastrophic negative r² (r2_shouldbepositive_onlineardata saw
r² ≈ -10¹²).

replace with a tikhonov-regularized svd solve on x directly:
  weights = v · diag(σ / (σ² + λ²)) · uᵀ · y
with λ = 1e-6 · σ_max. this smoothly damps the ill-conditioned
directions instead of zeroing them (which a hard-tolerance pseudoinverse
would, dropping real signal along with roundoff) and avoids forming
the normal-equations matrix that was the source of the explosion.

locally rbfregression: nan predictions cleared, r² on linear data
improved by 11+ orders of magnitude (from ~-10¹² to single-digit
negative). a couple of r²-positivity tests still fail — likely
center-placement / gamma choice, separate improvement — but the
nan-poisoning is gone.

* fix: address 10 CodeRabbit review comments on PR #1156

- AesGcmModelArtifactProtector.SanitizeFileName: reject Windows DOS
  reserved device names (CON/PRN/AUX/NUL/COM1-9/LPT1-9) and trim trailing
  dot/space characters. Previously portable-artifact guarantee failed on
  names like "CON.bin" or "model." — now prefixed with '_' and trimmed so
  artifacts created on POSIX hosts still mount on Windows.
- DiTNoisePredictor.ForwardBlock + FinalLayerWithAdaLN: guard against
  misconfigured AdaLN modulation output sizes. If modulation.Length isn't
  divisible by 6 * _hiddenSize (or 2 * _hiddenSize for final layer),
  throw InvalidOperationException with a clear diagnostic rather than
  letting integer division truncate silently and Engine.Reshape throw a
  cryptic shape-mismatch error downstream.
- RobustFileOpsMoveRetryTests: renamed
  Move_SucceedsAfter_TransientSharingViolation → ...TransientMissingParentDirectory
  and Move_Propagates_WhenLockNeverReleases → ...WhenParentDirectoryNeverCreated
  so the test names match the actual cross-platform retry trigger (missing
  destination parent directory, not lock/share violation which doesn't
  work on Linux). Fixed XML-doc reference from IOException → DirectoryNotFoundException.
- PredictionStats.CalculatePredictionStats: reuse R2 + AdjustedR2 already
  computed eagerly in the constructor with identical inputs, instead of
  recalculating them in the lazy-compute path. Cuts two O(n) scans.
- NeuralNetworkBase: new protected PromoteToBatchedTensor + EnsureBatchForCnnTraining
  helpers. Extracted from the duplicated 4-line rank-3 → rank-4 input
  expansion pattern that ResNet/VGG/MobileNetV2/ConvolutionalNeuralNetwork
  all carried individually. Subclasses' Train() now delegates to the base
  helper and removes their private AddBatchDimension copies.
  (Name differs from per-subclass AddBatchDimension to avoid CS0108
  hides-inherited warnings on 10+ segmentation subclasses that keep their
  own local helpers for non-CNN-training paths.)

Verify:
- src build net10.0 — 0 errors
- tests build net10.0 — 0 errors
- Tensors 0.46.1 confirmed published on NuGet

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: franklinic <franklin@ivorycloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants