Skip to content

[STORY] Optimized Commit Retrieval - Single Git Call Per Commit #471

@jsbattig

Description

@jsbattig

User Story: Optimized Commit Retrieval

As a developer using temporal git history indexing
I want commit data retrieval to use single batched git operations
So that indexing completes 2-3x faster, reducing wait time from 50 minutes to 15-20 minutes

Story Context

Current Performance:

  • 10-12 git subprocess calls per commit (330ms git overhead)
  • Throughput: 4.5 files/s, 35 KB/s
  • Large repo (82K files): 50+ minutes to index

Target Performance:

  • 1 git call per commit using git show --format="" commit (33ms overhead)
  • Throughput: 10-12 files/s, 70-90 KB/s
  • Large repo: 15-20 minutes to index

Impact:

  • 10x reduction in git operations
  • 40% overall speedup
  • 2-3x faster temporal indexing

Implementation Components

  • Single git call implementation in TemporalDiffScanner
  • Unified diff output parser
  • File type detection from diff headers
  • Blob hash extraction from index lines
  • Binary file detection from diff output
  • Override filtering integration preserved
  • Unit tests (0/0 passing)
  • Integration tests (0/0 passing)
  • Performance tests (0/0 passing)
  • E2E manual testing completed by Claude Code

Completion: 0/10 tasks complete (0%)

Algorithm

TemporalDiffScanner:
  Data structures:
    - diffs: List[DiffInfo]  # Output collection
    - current_file: FileInfo  # Current file being parsed
    - override_filter_service: OverrideFilterService

  get_diffs_for_commit(commit_hash):
    # PHASE 1: Single git call to get all changes
    output = subprocess.run(["git", "show", "--format=", commit_hash])
    
    # PHASE 2: Parse unified diff output
    diffs = []
    current_file = None
    diff_content_lines = []
    
    FOR each line IN output.split("\n"):
      IF line starts with "diff --git":
        # New file starting - save previous file if exists
        IF current_file is not None:
          IF should_include_file(current_file.path):
            diff = create_diff_info(current_file, diff_content_lines)
            diffs.append(diff)
        
        # Parse file paths from: diff --git a/path b/path
        current_file = parse_diff_header(line)
        diff_content_lines = []
      
      ELSE IF line starts with "new file mode":
        current_file.type = "added"
      
      ELSE IF line starts with "deleted file mode":
        current_file.type = "deleted"
      
      ELSE IF line starts with "rename from":
        current_file.type = "renamed"
        current_file.old_path = extract_path(line)
      
      ELSE IF line starts with "index ":
        # Extract blob hashes: index abc123..def456
        current_file.old_blob, current_file.new_blob = parse_index_line(line)
      
      ELSE IF line starts with "Binary files":
        current_file.is_binary = True
      
      ELSE IF line starts with ("---", "+++", "@@", "+", "-", " "):
        # Diff content - accumulate
        diff_content_lines.append(line)
    
    # Don't forget last file
    IF current_file is not None AND should_include_file(current_file.path):
      diff = create_diff_info(current_file, diff_content_lines)
      diffs.append(diff)
    
    RETURN diffs

  create_diff_info(file_info, diff_lines):
    # Determine diff type
    IF file_info.is_binary:
      diff_type = "binary"
      diff_content = f"Binary file {file_info.type}: {file_info.path}"
    
    ELSE IF file_info.type == "renamed" AND no content changes:
      diff_type = "renamed"
      diff_content = f"File renamed from {file_info.old_path} to {file_info.path}"
    
    ELSE:
      diff_type = file_info.type  # "added", "deleted", "modified"
      diff_content = "\n".join(diff_lines)
    
    RETURN DiffInfo(
      file_path = file_info.path,
      diff_type = diff_type,
      diff_content = diff_content,
      blob_hash = file_info.new_blob OR file_info.old_blob,
      old_path = file_info.old_path,
      parent_commit_hash = get_parent_if_needed(file_info)
    )

Key Algorithm Features:

  1. Single git call gets all changes in unified diff format
  2. State machine parsing (track current file, accumulate diff lines)
  3. Header detection determines file type (added/deleted/modified/renamed/binary)
  4. Blob hash extraction from "index" line
  5. Override filtering applied per file
  6. Preserves all existing metadata and deduplication logic

Acceptance Criteria

Scenario 1: Single git call retrieves all commit changes
  Given a commit with 10 files (4 added, 3 modified, 2 deleted, 1 renamed)
  When get_diffs_for_commit() is called
  Then exactly 1 git subprocess call is made
  And the call is "git show --format= <commit_hash>"
  And all 10 files are returned in diffs list

Scenario 2: Added files extracted correctly from unified diff
  Given a commit that adds a new file with 100 lines
  When parsing the unified diff output
  Then diff_type is "added"
  And diff_content contains all 100 lines prefixed with "+"
  And blob_hash is extracted from "index" line
  And file marked as new file mode

Scenario 3: Deleted files extracted correctly from unified diff
  Given a commit that deletes a file with 50 lines
  When parsing the unified diff output
  Then diff_type is "deleted"
  And diff_content contains all 50 lines prefixed with "-"
  And blob_hash is extracted from old blob in "index" line
  And parent_commit_hash is populated

Scenario 4: Modified files show only changes
  Given a commit that modifies 20 lines in a 500-line file
  When parsing the unified diff output
  Then diff_type is "modified"
  And diff_content contains only the changed lines with @@ markers
  And diff_content does NOT contain all 500 lines
  And blob_hash is extracted from new blob in "index" line

Scenario 5: Binary files detected from diff output
  Given a commit with binary files (images, PDFs)
  When parsing the unified diff output containing "Binary files differ"
  Then diff_type is "binary"
  And diff_content is metadata only (not binary data)
  And file is marked as binary

Scenario 6: Renamed files without content changes
  Given a commit that renames a file without modifying it
  When parsing the unified diff output with "rename from/to"
  Then diff_type is "renamed"
  And old_path and new_path are both populated
  And diff_content shows rename metadata

Scenario 7: Override filtering still works
  Given a commit touching files in excluded directory (help/)
  When get_diffs_for_commit() is called
  Then excluded files are filtered out
  And only non-excluded files appear in diffs list
  And git call still executes only once

Scenario 8: Performance improvement measurable
  Given temporal indexing on Evolution codebase (82K files)
  When indexing 100 commits
  Then git operation time reduces from ~330ms to ~33ms per commit
  And overall throughput improves from 4.5 to 10-12 files/s
  And total indexing time reduces by 30-40%

Manual Testing Strategy

Manual Testing Approach:

  1. Baseline Measurement (Before optimization):

    • Run: cidx index --index-commits --clear on test repository
    • Monitor with: ps -L -p <PID> -o lwp,pcpu,state,wchan:30
    • Capture: Git subprocess count, throughput (files/s), total time
  2. Implementation Testing (After optimization):

    • Run: cidx index --index-commits --clear on same repository
    • Monitor same metrics
    • Verify: Single git call per commit (using process monitoring or logs)
  3. Functional Verification:

    • Run: cidx query "authentication" --temporal
    • Verify: Results include added, modified, and deleted files
    • Verify: Results match pre-optimization behavior (same content indexed)
  4. Large Repository Validation:

    • Test on Evolution codebase (82,742 files)
    • Measure: Time to index 1000 commits
    • Expected: 30-40% reduction in indexing time
  5. Edge Case Testing:

    • Test commit with only binary files
    • Test commit with only renamed files
    • Test commit with excluded files (help/ directory)
    • Verify: All handled correctly without errors

Evidence to Capture:

  • Before/after git subprocess counts (strace or process monitoring)
  • Before/after throughput metrics (files/s, KB/s)
  • Before/after total indexing time for same commit range
  • Query results proving search functionality intact
  • Screenshots or logs of progress display showing improved speed

Testing Requirements

  • Unit tests covering all diff parsing logic paths (added/deleted/modified/renamed/binary)
  • Integration tests for git call reduction (12 calls → 1 call)
  • Performance tests measuring git overhead reduction (330ms → 33ms)
  • E2E tests validating complete temporal indexing workflow
  • Manual testing via running cidx on Evolution codebase and measuring performance

Definition of Done

  • ✅ All acceptance criteria satisfied
  • ✅ >90% unit test coverage achieved
  • ✅ Integration tests passing
  • ✅ E2E tests with zero mocking passing
  • ✅ Code review approved
  • ✅ Manual end-to-end testing completed by Claude Code
  • ✅ No lint/type errors
  • ✅ Working software deployable to users
  • ✅ Performance improvement validated (2-3x faster measured)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions