Skip to content

feat(repo stats): dedup commits by git patch-id#40

Merged
kyle-rader merged 1 commit into
kyle-rader:mainfrom
kyle-rader-msft:user/kyrader/repo-stats-patch-id-dedup
May 22, 2026
Merged

feat(repo stats): dedup commits by git patch-id#40
kyle-rader merged 1 commit into
kyle-rader:mainfrom
kyle-rader-msft:user/kyrader/repo-stats-patch-id-dedup

Conversation

@kyle-rader-msft
Copy link
Copy Markdown
Contributor

Problem

lk repo stats inflates contributor counts when a repo has logically-identical commits with different SHAs — the typical case being cross-repo history migrations (rebase, cherry-pick, or graft merges), but it also shows up for any team that rebases / cherry-picks heavily.

Case study: Microsoft Agency repo

The Agency repo ingested the ESAI client code via a true git merge that preserved both parents — but a slice of the same ESAI history had also been synced in earlier as rebased copies. Git correctly attributes each SHA to its original author, but the same logical patch gets counted twice. Before this PR:

Mode Total commits Top contributor
lk repo stats --all 5,309 Kyle Rader: 704 (massively inflated — he drove the migration)

After:

Mode Total commits Top contributor
lk repo stats --all 4,471 (838 collapsed) Hardik Kumari: 378 (correct)
lk repo stats --all --no-dedup 5,309 Kyle Rader: 704 (old behavior)

Solution

Add patch-id deduplication, on by default, with --no-dedup to opt out.

For each set of commits sharing a git patch-id --stable, keep the commit with the earliest author date (SHA breaks ties). Commits with no patch output (empty diffs, merge commits) are always treated as unique.

The summary line surfaces the count:

Total commits: 4471 (838 duplicate patches collapsed; --no-dedup to disable)

When --no-dedup is passed, the suffix is omitted.

Implementation

Refactored repo_stats() from a one-pass tally-during-git log loop into collect-then-tally. The new helpers:

  • collect_raw_commits() — buffers (sha, ts, name, email) from git log.
  • compute_patch_ids() — spawns git diff-tree --stdin -p | git patch-id --stable with a writer thread feeding SHAs and a reader collecting <patch-id> <sha> pairs (avoids the classic pipe-buffer deadlock).
  • dedup_commits() — groups by patch-id (or own SHA for unmapped commits) and picks the earliest winner per group.

Filtering, the active-window logic, and author canonicalization all run on the deduped winner list, so existing flags (--days, --from, --top, --name, --email, --all / --first-parent) keep working as expected.

Tests

Added 5 unit tests for dedup_commits() covering: no patch-ids, basic collapse, timestamp ties (SHA tiebreak), unmapped commits remaining unique, and mixed mapped+unmapped. Full cargo test suite (38 tests) passes.

Compat

Backward-compatible feature addition — minor version bump 2.4.0 → 2.5.0 per AGENTS.md.

Existing users who want the raw SHA counts can pass --no-dedup.

Performance

Full Agency repo (~5k commits, no time filter) dedup pipeline runs in ~3s; typical time-windowed queries are sub-second. The existing ProgressMeter covers it.

Out of scope (potential follow-ups)

  • A --show-duplicates diagnostic listing per-author duplicate counts
  • .mailmap integration
  • --since-ref to skip dedup before a given graft point

Add patch-id based deduplication to `lk repo stats` so logically-identical
commits (e.g. rebased / cherry-picked / cross-repo-migrated history) are
counted once instead of inflating contributor totals.

For each `git patch-id --stable` group, the commit with the earliest
author date wins (SHA breaks ties for determinism). Commits with no
patch (empty diffs, merge commits) are always treated as unique.

The summary line surfaces the count of collapsed duplicates:

  Total commits: 4471 (838 duplicate patches collapsed; --no-dedup to disable)

Pass `--no-dedup` to restore the pre-2.5.0 behavior of counting every SHA
individually.

Validated against a real cross-repo migration: in the Microsoft Agency
repo (where the ESAI client code was merged in, leaving many duplicate
patches) the top contributor list shifts from being dominated by whoever
happened to drive the migration to correctly reflecting who actually
wrote the most code.

Bumps the crate version 2.4.0 -> 2.5.0 per AGENTS.md (backward-compatible
feature).
@kyle-rader kyle-rader merged commit 088c8b6 into kyle-rader:main May 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants