[STORY] Optimized Commit Retrieval - Single Git Call Per Commit

# User Story: Optimized Commit Retrieval

**As a** developer using temporal git history indexing  
**I want** commit data retrieval to use single batched git operations  
**So that** indexing completes 2-3x faster, reducing wait time from 50 minutes to 15-20 minutes

## Story Context

**Current Performance:**
- 10-12 git subprocess calls per commit (330ms git overhead)
- Throughput: 4.5 files/s, 35 KB/s
- Large repo (82K files): 50+ minutes to index

**Target Performance:**
- 1 git call per commit using `git show --format="" commit` (33ms overhead)
- Throughput: 10-12 files/s, 70-90 KB/s
- Large repo: 15-20 minutes to index

**Impact:**
- 10x reduction in git operations
- 40% overall speedup
- 2-3x faster temporal indexing

## Implementation Components

- [ ] Single git call implementation in TemporalDiffScanner
- [ ] Unified diff output parser
- [ ] File type detection from diff headers
- [ ] Blob hash extraction from index lines
- [ ] Binary file detection from diff output
- [ ] Override filtering integration preserved
- [ ] Unit tests (0/0 passing)
- [ ] Integration tests (0/0 passing)
- [ ] Performance tests (0/0 passing)
- [x] E2E manual testing completed by Claude Code

**Completion:** 0/10 tasks complete (0%)

## Algorithm

```
TemporalDiffScanner:
  Data structures:
    - diffs: List[DiffInfo]  # Output collection
    - current_file: FileInfo  # Current file being parsed
    - override_filter_service: OverrideFilterService

  get_diffs_for_commit(commit_hash):
    # PHASE 1: Single git call to get all changes
    output = subprocess.run(["git", "show", "--format=", commit_hash])
    
    # PHASE 2: Parse unified diff output
    diffs = []
    current_file = None
    diff_content_lines = []
    
    FOR each line IN output.split("\n"):
      IF line starts with "diff --git":
        # New file starting - save previous file if exists
        IF current_file is not None:
          IF should_include_file(current_file.path):
            diff = create_diff_info(current_file, diff_content_lines)
            diffs.append(diff)
        
        # Parse file paths from: diff --git a/path b/path
        current_file = parse_diff_header(line)
        diff_content_lines = []
      
      ELSE IF line starts with "new file mode":
        current_file.type = "added"
      
      ELSE IF line starts with "deleted file mode":
        current_file.type = "deleted"
      
      ELSE IF line starts with "rename from":
        current_file.type = "renamed"
        current_file.old_path = extract_path(line)
      
      ELSE IF line starts with "index ":
        # Extract blob hashes: index abc123..def456
        current_file.old_blob, current_file.new_blob = parse_index_line(line)
      
      ELSE IF line starts with "Binary files":
        current_file.is_binary = True
      
      ELSE IF line starts with ("---", "+++", "@@", "+", "-", " "):
        # Diff content - accumulate
        diff_content_lines.append(line)
    
    # Don't forget last file
    IF current_file is not None AND should_include_file(current_file.path):
      diff = create_diff_info(current_file, diff_content_lines)
      diffs.append(diff)
    
    RETURN diffs

  create_diff_info(file_info, diff_lines):
    # Determine diff type
    IF file_info.is_binary:
      diff_type = "binary"
      diff_content = f"Binary file {file_info.type}: {file_info.path}"
    
    ELSE IF file_info.type == "renamed" AND no content changes:
      diff_type = "renamed"
      diff_content = f"File renamed from {file_info.old_path} to {file_info.path}"
    
    ELSE:
      diff_type = file_info.type  # "added", "deleted", "modified"
      diff_content = "\n".join(diff_lines)
    
    RETURN DiffInfo(
      file_path = file_info.path,
      diff_type = diff_type,
      diff_content = diff_content,
      blob_hash = file_info.new_blob OR file_info.old_blob,
      old_path = file_info.old_path,
      parent_commit_hash = get_parent_if_needed(file_info)
    )
```

**Key Algorithm Features:**
1. Single git call gets all changes in unified diff format
2. State machine parsing (track current file, accumulate diff lines)
3. Header detection determines file type (added/deleted/modified/renamed/binary)
4. Blob hash extraction from "index" line
5. Override filtering applied per file
6. Preserves all existing metadata and deduplication logic

## Acceptance Criteria

```gherkin
Scenario 1: Single git call retrieves all commit changes
  Given a commit with 10 files (4 added, 3 modified, 2 deleted, 1 renamed)
  When get_diffs_for_commit() is called
  Then exactly 1 git subprocess call is made
  And the call is "git show --format= <commit_hash>"
  And all 10 files are returned in diffs list

Scenario 2: Added files extracted correctly from unified diff
  Given a commit that adds a new file with 100 lines
  When parsing the unified diff output
  Then diff_type is "added"
  And diff_content contains all 100 lines prefixed with "+"
  And blob_hash is extracted from "index" line
  And file marked as new file mode

Scenario 3: Deleted files extracted correctly from unified diff
  Given a commit that deletes a file with 50 lines
  When parsing the unified diff output
  Then diff_type is "deleted"
  And diff_content contains all 50 lines prefixed with "-"
  And blob_hash is extracted from old blob in "index" line
  And parent_commit_hash is populated

Scenario 4: Modified files show only changes
  Given a commit that modifies 20 lines in a 500-line file
  When parsing the unified diff output
  Then diff_type is "modified"
  And diff_content contains only the changed lines with @@ markers
  And diff_content does NOT contain all 500 lines
  And blob_hash is extracted from new blob in "index" line

Scenario 5: Binary files detected from diff output
  Given a commit with binary files (images, PDFs)
  When parsing the unified diff output containing "Binary files differ"
  Then diff_type is "binary"
  And diff_content is metadata only (not binary data)
  And file is marked as binary

Scenario 6: Renamed files without content changes
  Given a commit that renames a file without modifying it
  When parsing the unified diff output with "rename from/to"
  Then diff_type is "renamed"
  And old_path and new_path are both populated
  And diff_content shows rename metadata

Scenario 7: Override filtering still works
  Given a commit touching files in excluded directory (help/)
  When get_diffs_for_commit() is called
  Then excluded files are filtered out
  And only non-excluded files appear in diffs list
  And git call still executes only once

Scenario 8: Performance improvement measurable
  Given temporal indexing on Evolution codebase (82K files)
  When indexing 100 commits
  Then git operation time reduces from ~330ms to ~33ms per commit
  And overall throughput improves from 4.5 to 10-12 files/s
  And total indexing time reduces by 30-40%
```

## Manual Testing Strategy

**Manual Testing Approach:**

1. **Baseline Measurement** (Before optimization):
   - Run: `cidx index --index-commits --clear` on test repository
   - Monitor with: `ps -L -p <PID> -o lwp,pcpu,state,wchan:30`
   - Capture: Git subprocess count, throughput (files/s), total time

2. **Implementation Testing** (After optimization):
   - Run: `cidx index --index-commits --clear` on same repository
   - Monitor same metrics
   - Verify: Single git call per commit (using process monitoring or logs)

3. **Functional Verification**:
   - Run: `cidx query "authentication" --temporal` 
   - Verify: Results include added, modified, and deleted files
   - Verify: Results match pre-optimization behavior (same content indexed)

4. **Large Repository Validation**:
   - Test on Evolution codebase (82,742 files)
   - Measure: Time to index 1000 commits
   - Expected: 30-40% reduction in indexing time

5. **Edge Case Testing**:
   - Test commit with only binary files
   - Test commit with only renamed files
   - Test commit with excluded files (help/ directory)
   - Verify: All handled correctly without errors

**Evidence to Capture:**
- Before/after git subprocess counts (strace or process monitoring)
- Before/after throughput metrics (files/s, KB/s)
- Before/after total indexing time for same commit range
- Query results proving search functionality intact
- Screenshots or logs of progress display showing improved speed

## Testing Requirements

- Unit tests covering all diff parsing logic paths (added/deleted/modified/renamed/binary)
- Integration tests for git call reduction (12 calls → 1 call)
- Performance tests measuring git overhead reduction (330ms → 33ms)
- E2E tests validating complete temporal indexing workflow
- Manual testing via running cidx on Evolution codebase and measuring performance

## Definition of Done

- ✅ All acceptance criteria satisfied
- ✅ >90% unit test coverage achieved
- ✅ Integration tests passing
- ✅ E2E tests with zero mocking passing
- ✅ Code review approved
- ✅ Manual end-to-end testing completed by Claude Code
- ✅ No lint/type errors
- ✅ Working software deployable to users
- ✅ Performance improvement validated (2-3x faster measured)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[STORY] Optimized Commit Retrieval - Single Git Call Per Commit #471

User Story: Optimized Commit Retrieval

Story Context

Implementation Components

Algorithm

Acceptance Criteria

Manual Testing Strategy

Testing Requirements

Definition of Done

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[STORY] Optimized Commit Retrieval - Single Git Call Per Commit #471

Description

User Story: Optimized Commit Retrieval

Story Context

Implementation Components

Algorithm

Acceptance Criteria

Manual Testing Strategy

Testing Requirements

Definition of Done

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions