feat(repo stats): dedup commits by git patch-id#40
Merged
kyle-rader merged 1 commit intoMay 22, 2026
Merged
Conversation
Add patch-id based deduplication to `lk repo stats` so logically-identical commits (e.g. rebased / cherry-picked / cross-repo-migrated history) are counted once instead of inflating contributor totals. For each `git patch-id --stable` group, the commit with the earliest author date wins (SHA breaks ties for determinism). Commits with no patch (empty diffs, merge commits) are always treated as unique. The summary line surfaces the count of collapsed duplicates: Total commits: 4471 (838 duplicate patches collapsed; --no-dedup to disable) Pass `--no-dedup` to restore the pre-2.5.0 behavior of counting every SHA individually. Validated against a real cross-repo migration: in the Microsoft Agency repo (where the ESAI client code was merged in, leaving many duplicate patches) the top contributor list shifts from being dominated by whoever happened to drive the migration to correctly reflecting who actually wrote the most code. Bumps the crate version 2.4.0 -> 2.5.0 per AGENTS.md (backward-compatible feature).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
lk repo statsinflates contributor counts when a repo has logically-identical commits with different SHAs — the typical case being cross-repo history migrations (rebase, cherry-pick, or graft merges), but it also shows up for any team that rebases / cherry-picks heavily.Case study: Microsoft Agency repo
The Agency repo ingested the ESAI client code via a true
git mergethat preserved both parents — but a slice of the same ESAI history had also been synced in earlier as rebased copies. Git correctly attributes each SHA to its original author, but the same logical patch gets counted twice. Before this PR:lk repo stats --allAfter:
lk repo stats --alllk repo stats --all --no-dedupSolution
Add patch-id deduplication, on by default, with
--no-dedupto opt out.For each set of commits sharing a
git patch-id --stable, keep the commit with the earliest author date (SHA breaks ties). Commits with no patch output (empty diffs, merge commits) are always treated as unique.The summary line surfaces the count:
When
--no-dedupis passed, the suffix is omitted.Implementation
Refactored
repo_stats()from a one-pass tally-during-git logloop into collect-then-tally. The new helpers:collect_raw_commits()— buffers(sha, ts, name, email)fromgit log.compute_patch_ids()— spawnsgit diff-tree --stdin -p | git patch-id --stablewith a writer thread feeding SHAs and a reader collecting<patch-id> <sha>pairs (avoids the classic pipe-buffer deadlock).dedup_commits()— groups by patch-id (or own SHA for unmapped commits) and picks the earliest winner per group.Filtering, the active-window logic, and author canonicalization all run on the deduped winner list, so existing flags (
--days,--from,--top,--name,--email,--all/--first-parent) keep working as expected.Tests
Added 5 unit tests for
dedup_commits()covering: no patch-ids, basic collapse, timestamp ties (SHA tiebreak), unmapped commits remaining unique, and mixed mapped+unmapped. Fullcargo testsuite (38 tests) passes.Compat
Backward-compatible feature addition — minor version bump 2.4.0 → 2.5.0 per
AGENTS.md.Existing users who want the raw SHA counts can pass
--no-dedup.Performance
Full Agency repo (~5k commits, no time filter) dedup pipeline runs in ~3s; typical time-windowed queries are sub-second. The existing
ProgressMetercovers it.Out of scope (potential follow-ups)
--show-duplicatesdiagnostic listing per-author duplicate counts.mailmapintegration--since-refto skip dedup before a given graft point