Skip to content

marklam/TierProblems

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TierProblems

Self-contained .NET 8 console app that reproduces a JIT tier-up miscompilation observed on an experimental fork of PureHDF (branch reflection-cache-2 in marklam/marklam-purehdf), not on upstream Apollo3zehn/PureHDF. The caches described below were prototyped on that fork as potential PRs to PureHDF but were not submitted upstream — issues, including this JIT interaction, were uncovered during evaluation. Nothing in this repro should be taken as endorsed by or representative of the official PureHDF project.

The repro inlines just enough of the fork's modified read path (NativeAttribute.Read<T> plus a per-instance decoder cache on DatatypeMessage that the fork added) to drive the same call shape against in-memory data. No file I/O, no NuGet package, no project reference — everything needed is in this folder.

What goes wrong

The reflection-cache-2 branch on the experimental PureHDF fork added two ConcurrentDictionary-of-Delegate caches:

  • NativeAttribute._readerCache keyed by (TResult, TElement) — holds a ReaderDelegate<TResult> built via MethodInfo.CreateDelegate(...) and pointed at ReadCoreLevel1_generic<TResult, TElement>.
  • DatatypeMessage._decodeInfoCache keyed by (TElement, isRawMode) — holds a DecodeDelegate<TElement> whose target is a static local function <GetDecodeInfoForUnmanagedMemory>g__decode|N_0[T] built via MakeGenericMethod(...).Invoke(...).

Together the caches mean every Read<T> reuses the same two delegate instances. The static local function called via the inner delegate becomes the permanently hot site for unmanaged reads. Once JIT tier-up promotes that call chain, the runtime intermittently produces bad code. Disabling tier-up ($env:DOTNET_TieredCompilation = "0") eliminates every symptom and slows the benchmark ~60%.

The symptom isn't constant. On this box across many runs the same binary has produced at least these distinct failures:

Symptom Where
OverflowException at MemoryMarshal.AsBytes checked(span.Length * sizeof(T)) with Length=1, sizeof(T)=12
Exception: total file element count != total memory element count upstream value corruption (one side reads wrong)
InvalidOperationException from the wrong branch of an is null check (buffer is null || buffer.Equals(default)) returns false even though buffer is default(TResult)
NullReferenceException deep in RuntimeType.ListBuilder.Add runtime/reflection internals stepped on
EntryPointNotFoundException at System.IDisposable.Dispose() method-table corruption visible during teardown
Process abort with exit 0xC0000005 (access violation) heap corruption GC walks into during teardown

All disappear at DOTNET_TieredCompilation=0. All are observed on the same machine, same SDK, same binary. The repro here surfaces the "wrong branch of is null check" and the deep-runtime NullReferenceException variants reliably; other variants surface when calling PureHDF for real.

What's in this repo

  • Program.cs — driver. Builds four NativeAttribute instances (int, long, double, Sample) with synthetic per-attribute Size, primes each (TResult, TElement) cache entry once, then hot-loops sampleAttribute.Read<Sample>() up to 200M calls and catches whatever the runtime throws.
  • InlinedPureHdf.cs — minimal copy of the read-path code the bug depends on, taken from the experimental fork (not upstream PureHDF):
    • IH5ReadStream, SystemMemoryStream, ArrayMemoryManager<T>, DecodeDelegate<T> — verbatim from the fork's src/PureHDF/....
    • DatatypeMessage — only the compound unmanaged scalar branch, with the _decodeInfoCache and the GetDecodeInfoForUnmanagedMemory<T> static local exactly as on the fork's reflection-cache-2 branch (i.e. with the pre-fork-fix MemoryMarshal.AsBytes(target) call). Other class branches (string, reference, array, variable-length, …) are stubbed out behind a throw.
    • NativeAttribute — the _readerCache, ReadCoreLevel1_generic, ReadCoreLevel2, GetDecoderAndFileElementCount. Identical method names so stack traces line up with the original.
    • DataUtils, WriteUtils, AttributeMessage, DataspaceMessage — only the members the hot path touches.

Running

cd d:\git\temp\TierProblems
dotnet build -c Release
dotnet run -c Release --no-build --framework net8.0   # default
dotnet run -c Release --no-build --framework net10.0

On a hit:

HIT after 156,419,186 calls in 24.0s
  outer: System.NullReferenceException: Object reference not set to an instance of an object.

System.NullReferenceException: Object reference not set to an instance of an object.
   at System.RuntimeType.ListBuilder`1.Add(T item)
   at TierProblems.Inlined.NativeAttribute.ReadCoreLevel1_generic[...]
   at TierProblems.Inlined.NativeAttribute.Read[T]
   ...

On a clean run:

no hit after 200,000,000 calls in 30.5s

Confirming tier-up is the trigger:

$env:DOTNET_TieredCompilation = "0"
dotnet run -c Release --no-build      # consistently passes, ~60% slower
Remove-Item Env:\DOTNET_TieredCompilation

Hit rate observed

On 13th Gen Intel Core i9-13900KS / Windows 11. The project multi-targets net8.0 and net10.0; run either with --framework net8.0 / --framework net10.0.

net8.0 (runtime 8.0.27)

15-run sample, default tier-up:

Outcome Count
pass 5
InvalidOperationException (wrong branch of is null check) 4
NullReferenceException (deep in runtime / RuntimeType.ListBuilder.Add) 2
Unhandled crash mid-warmup (e.g. get_Message() returns null) — outside try/catch 4

≈ 10/15 failures (~67%). With DOTNET_TieredCompilation=0: 0 hits in 6 runs over many minutes.

net10.0 (runtime 10.0.8)

45-run sample (15 + 30), default tier-up:

Outcome Count
pass 42
InvalidOperationException (wrong branch of is null check) 1
EntryPointNotFoundException at teardown (method-table corruption) 1
Unhandled NullReferenceException mid-warmup (get_Message() returns null) 1

≈ 3/45 failures (~7%). The bug is much less frequent on .NET 10 but not fixed — same family of symptoms still appears, and one variant (EntryPointNotFoundException at teardown) is new in this sample vs net8.0.

net10.0 also runs the hot loop noticeably faster (~17s vs ~30s for 200M calls), suggesting the JIT made smarter inlining/codegen decisions overall — but whatever specific path drives the corruption is still reachable.

Why a "shape-only" repro didn't work

An earlier iteration of this repro tried to match the call shape without copying the fork's classes — its own Reader<T>/Decoder<T> delegate caches, an IReadStream interface, a clone of ArrayMemoryManager<T>, the same static local function pattern. That shape-only version ran ~7× faster per call (~35M reads/sec vs the fork's ~5–6M) and never hit the bug at 1B+ calls. The JIT inlined the lean version enough to constant-fold the failing operations away.

Inlining the fork's classes verbatim (with the per-call SystemMemoryStream allocation, the MemoryManager<T> virtual GetSpan(), the ulong[1] memoryDims allocation, the new TElement[1] allocation, and the reflection chain into IsReferenceOrContainsReferences<T>) brings per-call cost down to ~6M reads/sec — and the bug starts firing.

This says the bug is sensitive to how much code is in the tier-up target, not just the abstract call shape. That matches the observed behaviour when linking the fork as a project reference but not calling any of it: tier-up still works correctly, because no method on the fork ever becomes hot.

What this is good for

Anything you'd want a small-but-real reproducer for: filing a dotnet/runtime issue, bisecting against .NET 8.0.x patch releases, testing whether the bug is fixed in .NET 9/.NET 10, or experimenting with mitigations (MethodImplOptions.NoOptimization on individual methods, disabling specific tier-up sub-features via DOTNET_TC_* env vars, etc.).

It is not the fork's own benchmark — that benchmark exists at ..\marklam-purehdf\benchmarks\PureHDF.Benchmarks (still inside the experimental fork, not in upstream PureHDF) and also reproduces the bug, but it pulls in the rest of the codebase plus BenchmarkDotNet.

Environment

  • Windows 11 (10.0.26200), x64
  • 13th Gen Intel Core i9-13900KS
  • .NET SDK 10.0.300
  • Runtimes tested: 8.0.27 (8.0.2726.22922), 10.0.8
  • Server GC

Fork-side workaround for the original OverflowException

A later commit on the same experimental fork (still not in upstream PureHDF) updates src/PureHDF/VOL/Native/FileFormat/Level2/ObjectHeaderMessages/Datatype/DatatypeMessage.Reading.cs so the cached static local no longer calls MemoryMarshal.AsBytes. It constructs the byte span directly:

var byteSpan = MemoryMarshal.CreateSpan(
    ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(target)),
    target.Length * Unsafe.SizeOf<T>());

source.ReadDataset(byteSpan);

That sidesteps the specific checked(...) throw site but does not address the broader JIT corruption — the other symptoms in the table above still occur on the unmodified call paths. It is one of several reasons the cache changes were not submitted as a PR to PureHDF.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages