fix(loaders): strip git log -p commit metadata before parsing#228
Conversation
Piping `git log -p` (or `git show -p` of multiple commits) through hunk's pager mode floods stderr with hundreds of `parseLineType: Invalid firstChar` / `processFile: invalid rawLine` warnings from @pierre/diffs. Root cause: pager mode's `looksLikePatchInput` heuristic accepts the stream because it contains `diff --git` and `@@`, then forwards the whole blob to `parsePatchFiles`. But @pierre/diffs is a strict patch parser: every hunk-body line must start with `+`, `-`, ` ` or `\`. The `commit <sha>`, `Author:`, `Date:`, blank, and 4-space-indented commit message lines that appear between commits all fail that check, and the parser logs each rejection unconditionally (no quiet flag). Fix in `normalizePatchChangeset` rather than in `pager.ts` so all input paths benefit (a saved `git log -p` fed via `hunk patch file.txt` would have hit the same bug). The new `stripGitLogMetadata`: - Fast-paths to a no-op when no `^commit [0-9a-f]+` boundary exists, keeping the regular patch path zero-allocation. - State-machines per line: enters "header" mode on a commit boundary, drops every following line until a patch start (`diff --git `, `--- `, `+++ `). - Preserves context lines that mention "commit" textually — they start with the diff line-type marker (` `/`+`/`-`), so the `^commit ` regex doesn't match. Tests cover: single-commit, multi-commit, decorated headers (refs in parens), abbreviated SHAs, --stat blocks, merge metadata, context lines mentioning "commit", trailing diff-less commit, and an end-to-end check that the stripped output round-trips through `parsePatchFiles` with zero `Invalid firstChar` warnings.
Greptile SummaryThis PR adds
Confidence Score: 4/5The core logic is correct and well-tested; the only gap is that SHA-256 repositories will silently bypass stripping and reintroduce the original warning noise. The state machine is straightforward and the test suite is thorough. The SHA regex upper bound of 40 means the fix does nothing for git repos using SHA-256 object format (64-character hashes), restoring the original bug in that case. There is also a minor duplicate regex literal in the loop body that could be unified with the named constant above it. src/core/loaders.ts — specifically the two occurrences of Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["patchText input"] --> B["replaceAll CRLF → LF"]
B --> C["stripTerminalControl"]
C --> D["stripGitLogMetadata"]
D --> E{{"commit sha boundary present?"}}
E -- No --> F["return text unchanged (zero-allocation fast-path)"]
E -- Yes --> G["split into lines, inHeader = false"]
G --> H["for each line"]
H --> I{{"line matches ^commit [0-9a-f]{4,40}"}}
I -- Yes --> J["inHeader = true, skip line"]
J --> H
I -- No --> K{{"inHeader?"}}
K -- No --> L["push to output"]
L --> H
K -- Yes --> M{{"starts with diff --git / --- / +++"}}
M -- No --> N["skip line (metadata)"]
N --> H
M -- Yes --> O["inHeader = false, push to output"]
O --> H
H -- done --> P["join lines → clean patch"]
P --> Q["parsePatchFiles"]
F --> Q
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
src/core/loaders.ts:86-96
**SHA-256 repositories not covered by regex range**
The pattern `[0-9a-f]{4,40}` caps at 40 hex characters, which covers SHA-1 hashes, but SHA-256 object hashes (used when a repo is initialised with `git init --object-format=sha256`) are 64 hex characters. For any such repo, neither the fast-path `COMMIT_BOUNDARY` test nor the per-line check in the loop will ever match, so the function returns the input unchanged and all `commit <64-char-sha>` boundary lines are fed to `@pierre/diffs`, reproducing the original warning spam. Widening both occurrences of the range to `{4,64}` covers both formats.
### Issue 2 of 2
src/core/loaders.ts:96
**Duplicate regex literal on every loop iteration**
`COMMIT_BOUNDARY` (line 86) was defined precisely for this test, but a new equivalent regex literal without the `m` flag is constructed inline here. Since each `line` from `split("\n")` is already a single string with no embedded newlines, `^` anchors to the start of the string regardless of the `m` flag, so both regexes behave identically per line. Reusing `COMMIT_BOUNDARY` directly (or extracting a module-level constant) removes the redundancy and makes future maintenance simpler.
Reviews (1): Last reviewed commit: "fix(loaders): strip git log -p commit me..." | Re-trigger Greptile |
|
Thanks! |
CI format:check was failing on these two files; collapsing the long expressions to oxfmt's preferred single-line form unblocks the pipeline. No behavior change.
Bumps the commit-boundary hex range from {4,40} to {4,64} so repos
initialised with --object-format=sha256 (Git 2.29+) get their git log
metadata stripped instead of leaking back into @pierre/diffs as
parseLineType warnings. Reuses the existing COMMIT_BOUNDARY constant
inside the per-line loop so the pattern lives in one place.
Both points were raised by Greptile on PR modem-dev#228.
Piping
git log -p(orgit show -pof multiple commits) through hunk's pager mode floods stderr with hundreds ofparseLineType: Invalid firstChar/processFile: invalid rawLinewarnings from @pierre/diffs.Root cause: pager mode's
looksLikePatchInputheuristic accepts the stream because it containsdiff --gitand@@, then forwards the whole blob toparsePatchFiles. But @pierre/diffs is a strict patch parser: every hunk-body line must start with+,-,or\. Thecommit <sha>,Author:,Date:, blank, and 4-space-indented commit message lines that appear between commits all fail that check, and the parser logs each rejection unconditionally (no quiet flag).Fix in
normalizePatchChangesetrather than inpager.tsso all input paths benefit (a savedgit log -pfed viahunk patch file.txtwould have hit the same bug). The newstripGitLogMetadata:^commit [0-9a-f]+boundary exists, keeping the regular patch path zero-allocation.diff --git,---,+++)./+/-), so the^commitregex doesn't match.Tests cover: single-commit, multi-commit, decorated headers (refs in parens), abbreviated SHAs, --stat blocks, merge metadata, context lines mentioning "commit", trailing diff-less commit, and an end-to-end check that the stripped output round-trips through
parsePatchFileswith zeroInvalid firstCharwarnings.