Consistent base-1000 byte units; rewrite chunk-quality scoring by octogonz · Pull Request #69 · microsoft/monodex

octogonz · 2026-05-22T01:59:12Z

Byte-unit consistency. RAM sizes were a mix of base-1024 and base-1000: auto-tuning planning constants used GiB/MiB arithmetic, while format_bytes divided by base-1024 but printed "GB"/"MB" labels, so displayed RAM figures were mislabelled. Everything now uses base-1000, so labels and values agree and no user-readable surface mentions GiB/MiB. The auto-tuning constants are padded round guesses, so the ~7% shift is well inside the heuristic's existing slop; the instance-count heuristic may move slightly.

Chunk-quality scoring rewrite. The 0-100 score behind the audit-chunks and dump-chunks developer tools was incoherent: well-chunked files could score 0, badly over-split files could score ~99.98, and several internal computations were dead or self-contradictory. It's now a two-dimensional smell score. Size health treats any chunk in the [500, 6000] char band as fine (the partitioner splits at AST boundaries, so a clean mid-band chunk is correct, not half-bad) and penalizes only the two tails. A separate, forgiving count penalty flags files split into far more chunks than the content needs. The two combine multiplicatively. This is a maintainer triage heuristic, not a pass/fail metric.

Replace base-1024 arithmetic with base-1000 in PER_INSTANCE_RAM, BASELINE_RESERVE_MIN, and EMBEDDING_OVERHEAD. Update all comments to use GB instead of GiB. These are conservative planning heuristics, not calibrated figures; the ~7% shift from rebasing is well inside their slop margin.

The function displayed GB/MB labels but divided by base-1024 constants, causing the printed numbers to disagree with the labels. Fix by moving to base-1000 constants so values and units agree. Also update README.md to use 2.5 GB consistently.

The new scoring formula measures two independent dimensions: size badness (penalizing chunks outside [500, 6000] healthy band) and count badness (penalizing severe over-splitting). The special case that a single whole-file chunk is never penalized ensures small files don't appear problematic. Update docs/design/chunker.md and README.md with the new guidance. Regenerate partitioner snapshot tests to reflect the new scores.

octogonz added 4 commits May 21, 2026 18:08

test: re-bless partitioner snapshots

208b23d

octogonz merged commit 732d54c into microsoft:main May 22, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent base-1000 byte units; rewrite chunk-quality scoring#69

Consistent base-1000 byte units; rewrite chunk-quality scoring#69
octogonz merged 4 commits into
microsoft:mainfrom
octogonz:main

octogonz commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

octogonz commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant