Consistent base-1000 byte units; rewrite chunk-quality scoring#69
Merged
Conversation
Replace base-1024 arithmetic with base-1000 in PER_INSTANCE_RAM, BASELINE_RESERVE_MIN, and EMBEDDING_OVERHEAD. Update all comments to use GB instead of GiB. These are conservative planning heuristics, not calibrated figures; the ~7% shift from rebasing is well inside their slop margin.
The function displayed GB/MB labels but divided by base-1024 constants, causing the printed numbers to disagree with the labels. Fix by moving to base-1000 constants so values and units agree. Also update README.md to use 2.5 GB consistently.
The new scoring formula measures two independent dimensions: size badness (penalizing chunks outside [500, 6000] healthy band) and count badness (penalizing severe over-splitting). The special case that a single whole-file chunk is never penalized ensures small files don't appear problematic. Update docs/design/chunker.md and README.md with the new guidance. Regenerate partitioner snapshot tests to reflect the new scores.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Byte-unit consistency. RAM sizes were a mix of base-1024 and base-1000: auto-tuning planning constants used GiB/MiB arithmetic, while format_bytes divided by base-1024 but printed "GB"/"MB" labels, so displayed RAM figures were mislabelled. Everything now uses base-1000, so labels and values agree and no user-readable surface mentions GiB/MiB. The auto-tuning constants are padded round guesses, so the ~7% shift is well inside the heuristic's existing slop; the instance-count heuristic may move slightly.
Chunk-quality scoring rewrite. The 0-100 score behind the audit-chunks and dump-chunks developer tools was incoherent: well-chunked files could score 0, badly over-split files could score ~99.98, and several internal computations were dead or self-contradictory. It's now a two-dimensional smell score. Size health treats any chunk in the [500, 6000] char band as fine (the partitioner splits at AST boundaries, so a clean mid-band chunk is correct, not half-bad) and penalizes only the two tails. A separate, forgiving count penalty flags files split into far more chunks than the content needs. The two combine multiplicatively. This is a maintainer triage heuristic, not a pass/fail metric.