Skip to content

fix(loaders): strip git log -p commit metadata before parsing#228

Merged
benvinegar merged 4 commits intomodem-dev:mainfrom
gonzaloserrano:fix/strip-git-log-metadata
May 8, 2026
Merged

fix(loaders): strip git log -p commit metadata before parsing#228
benvinegar merged 4 commits intomodem-dev:mainfrom
gonzaloserrano:fix/strip-git-log-metadata

Conversation

@gonzaloserrano
Copy link
Copy Markdown
Contributor

Piping git log -p (or git show -p of multiple commits) through hunk's pager mode floods stderr with hundreds of
parseLineType: Invalid firstChar / processFile: invalid rawLine warnings from @pierre/diffs.

Root cause: pager mode's looksLikePatchInput heuristic accepts the stream because it contains diff --git and @@, then forwards the whole blob to parsePatchFiles. But @pierre/diffs is a strict patch parser: every hunk-body line must start with +, -, or \. The commit <sha>, Author:, Date:, blank, and 4-space-indented commit message lines that appear between commits all fail that check, and the parser logs each rejection unconditionally (no quiet flag).

Fix in normalizePatchChangeset rather than in pager.ts so all input paths benefit (a saved git log -p fed via hunk patch file.txt would have hit the same bug). The new stripGitLogMetadata:

  • Fast-paths to a no-op when no ^commit [0-9a-f]+ boundary exists, keeping the regular patch path zero-allocation.
  • State-machines per line: enters "header" mode on a commit boundary, drops every following line until a patch start (diff --git , --- , +++ ).
  • Preserves context lines that mention "commit" textually — they start with the diff line-type marker ( /+/-), so the ^commit regex doesn't match.

Tests cover: single-commit, multi-commit, decorated headers (refs in parens), abbreviated SHAs, --stat blocks, merge metadata, context lines mentioning "commit", trailing diff-less commit, and an end-to-end check that the stripped output round-trips through parsePatchFiles with zero Invalid firstChar warnings.

Piping `git log -p` (or `git show -p` of multiple commits) through
hunk's pager mode floods stderr with hundreds of
`parseLineType: Invalid firstChar` / `processFile: invalid rawLine`
warnings from @pierre/diffs.

Root cause: pager mode's `looksLikePatchInput` heuristic accepts the
stream because it contains `diff --git` and `@@`, then forwards the
whole blob to `parsePatchFiles`. But @pierre/diffs is a strict patch
parser: every hunk-body line must start with `+`, `-`, ` ` or `\`. The
`commit <sha>`, `Author:`, `Date:`, blank, and 4-space-indented commit
message lines that appear between commits all fail that check, and the
parser logs each rejection unconditionally (no quiet flag).

Fix in `normalizePatchChangeset` rather than in `pager.ts` so all input
paths benefit (a saved `git log -p` fed via `hunk patch file.txt` would
have hit the same bug). The new `stripGitLogMetadata`:

- Fast-paths to a no-op when no `^commit [0-9a-f]+` boundary exists,
  keeping the regular patch path zero-allocation.
- State-machines per line: enters "header" mode on a commit boundary,
  drops every following line until a patch start (`diff --git `,
  `--- `, `+++ `).
- Preserves context lines that mention "commit" textually — they start
  with the diff line-type marker (` `/`+`/`-`), so the `^commit `
  regex doesn't match.

Tests cover: single-commit, multi-commit, decorated headers
(refs in parens), abbreviated SHAs, --stat blocks, merge metadata,
context lines mentioning "commit", trailing diff-less commit, and an
end-to-end check that the stripped output round-trips through
`parsePatchFiles` with zero `Invalid firstChar` warnings.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR adds stripGitLogMetadata to loaders.ts to strip the per-commit envelope lines (commit <sha>, Author:, Date:, indented message body) that git log -p emits between patch hunks, so the remaining text is a clean patch stream that @pierre/diffs can parse without flooding stderr with parseLineType: Invalid firstChar warnings. The fix is placed in normalizePatchChangeset so it applies to every input path (pager mode and direct file input alike).

  • stripGitLogMetadata fast-paths to a no-op when no commit <sha> boundary is detected, keeping the hot path zero-allocation for ordinary patch input.
  • The state machine exits header mode on the first diff --git, ---, or +++ line and does not disturb hunk-body lines whose leading +/-/ already prevents the ^commit regex from matching.
  • A comprehensive test suite covers single-commit, multi-commit, decorated refs, abbreviated SHAs, --stat blocks, merge metadata, context lines containing the word "commit", and an end-to-end round-trip through parsePatchFiles asserting zero noisy warnings.

Confidence Score: 4/5

The core logic is correct and well-tested; the only gap is that SHA-256 repositories will silently bypass stripping and reintroduce the original warning noise.

The state machine is straightforward and the test suite is thorough. The SHA regex upper bound of 40 means the fix does nothing for git repos using SHA-256 object format (64-character hashes), restoring the original bug in that case. There is also a minor duplicate regex literal in the loop body that could be unified with the named constant above it.

src/core/loaders.ts — specifically the two occurrences of {4,40} in stripGitLogMetadata.

Important Files Changed

Filename Overview
src/core/loaders.ts Adds stripGitLogMetadata to remove git log -p commit boundary lines before patch parsing; the SHA regex range {4,40} won't match 64-char SHA-256 hashes, and the same regex is duplicated in the loop body.
src/core/loaders.gitLog.test.ts New test suite covering strip-metadata scenarios including single/multi-commit, decorated headers, abbreviated SHAs, --stat blocks, merge metadata, context lines, trailing diff-less commits, and an end-to-end round-trip through @pierre/diffs.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["patchText input"] --> B["replaceAll CRLF → LF"]
    B --> C["stripTerminalControl"]
    C --> D["stripGitLogMetadata"]
    D --> E{{"commit sha boundary present?"}}
    E -- No --> F["return text unchanged (zero-allocation fast-path)"]
    E -- Yes --> G["split into lines, inHeader = false"]
    G --> H["for each line"]
    H --> I{{"line matches ^commit [0-9a-f]{4,40}"}}
    I -- Yes --> J["inHeader = true, skip line"]
    J --> H
    I -- No --> K{{"inHeader?"}}
    K -- No --> L["push to output"]
    L --> H
    K -- Yes --> M{{"starts with diff --git / --- / +++"}}
    M -- No --> N["skip line (metadata)"]
    N --> H
    M -- Yes --> O["inHeader = false, push to output"]
    O --> H
    H -- done --> P["join lines → clean patch"]
    P --> Q["parsePatchFiles"]
    F --> Q
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
src/core/loaders.ts:86-96
**SHA-256 repositories not covered by regex range**

The pattern `[0-9a-f]{4,40}` caps at 40 hex characters, which covers SHA-1 hashes, but SHA-256 object hashes (used when a repo is initialised with `git init --object-format=sha256`) are 64 hex characters. For any such repo, neither the fast-path `COMMIT_BOUNDARY` test nor the per-line check in the loop will ever match, so the function returns the input unchanged and all `commit <64-char-sha>` boundary lines are fed to `@pierre/diffs`, reproducing the original warning spam. Widening both occurrences of the range to `{4,64}` covers both formats.

### Issue 2 of 2
src/core/loaders.ts:96
**Duplicate regex literal on every loop iteration**

`COMMIT_BOUNDARY` (line 86) was defined precisely for this test, but a new equivalent regex literal without the `m` flag is constructed inline here. Since each `line` from `split("\n")` is already a single string with no embedded newlines, `^` anchors to the start of the string regardless of the `m` flag, so both regexes behave identically per line. Reusing `COMMIT_BOUNDARY` directly (or extracting a module-level constant) removes the redundancy and makes future maintenance simpler.

Reviews (1): Last reviewed commit: "fix(loaders): strip git log -p commit me..." | Re-trigger Greptile

Comment thread src/core/loaders.ts Outdated
Comment thread src/core/loaders.ts Outdated
@benvinegar
Copy link
Copy Markdown
Member

Thanks!

gonzaloserrano and others added 3 commits May 8, 2026 12:35
CI format:check was failing on these two files; collapsing the long
expressions to oxfmt's preferred single-line form unblocks the pipeline.
No behavior change.
Bumps the commit-boundary hex range from {4,40} to {4,64} so repos
initialised with --object-format=sha256 (Git 2.29+) get their git log
metadata stripped instead of leaking back into @pierre/diffs as
parseLineType warnings. Reuses the existing COMMIT_BOUNDARY constant
inside the per-line loop so the pattern lives in one place.

Both points were raised by Greptile on PR modem-dev#228.
@benvinegar benvinegar merged commit d991180 into modem-dev:main May 8, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants