Skip to content

Consistent base-1000 byte units; rewrite chunk-quality scoring#69

Merged
octogonz merged 4 commits into
microsoft:mainfrom
octogonz:main
May 22, 2026
Merged

Consistent base-1000 byte units; rewrite chunk-quality scoring#69
octogonz merged 4 commits into
microsoft:mainfrom
octogonz:main

Conversation

@octogonz
Copy link
Copy Markdown
Collaborator

Byte-unit consistency. RAM sizes were a mix of base-1024 and base-1000: auto-tuning planning constants used GiB/MiB arithmetic, while format_bytes divided by base-1024 but printed "GB"/"MB" labels, so displayed RAM figures were mislabelled. Everything now uses base-1000, so labels and values agree and no user-readable surface mentions GiB/MiB. The auto-tuning constants are padded round guesses, so the ~7% shift is well inside the heuristic's existing slop; the instance-count heuristic may move slightly.

Chunk-quality scoring rewrite. The 0-100 score behind the audit-chunks and dump-chunks developer tools was incoherent: well-chunked files could score 0, badly over-split files could score ~99.98, and several internal computations were dead or self-contradictory. It's now a two-dimensional smell score. Size health treats any chunk in the [500, 6000] char band as fine (the partitioner splits at AST boundaries, so a clean mid-band chunk is correct, not half-bad) and penalizes only the two tails. A separate, forgiving count penalty flags files split into far more chunks than the content needs. The two combine multiplicatively. This is a maintainer triage heuristic, not a pass/fail metric.

octogonz added 4 commits May 21, 2026 18:08
Replace base-1024 arithmetic with base-1000 in PER_INSTANCE_RAM,
BASELINE_RESERVE_MIN, and EMBEDDING_OVERHEAD. Update all comments
to use GB instead of GiB.

These are conservative planning heuristics, not calibrated figures;
the ~7% shift from rebasing is well inside their slop margin.
The function displayed GB/MB labels but divided by base-1024 constants,
causing the printed numbers to disagree with the labels. Fix by moving
to base-1000 constants so values and units agree.

Also update README.md to use 2.5 GB consistently.
The new scoring formula measures two independent dimensions:
size badness (penalizing chunks outside [500, 6000] healthy band)
and count badness (penalizing severe over-splitting). The special
case that a single whole-file chunk is never penalized ensures
small files don't appear problematic.

Update docs/design/chunker.md and README.md with the new guidance.
Regenerate partitioner snapshot tests to reflect the new scores.
@octogonz octogonz merged commit 732d54c into microsoft:main May 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant