-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Labels
Description
User Story: Optimized Commit Retrieval
As a developer using temporal git history indexing
I want commit data retrieval to use single batched git operations
So that indexing completes 2-3x faster, reducing wait time from 50 minutes to 15-20 minutes
Story Context
Current Performance:
- 10-12 git subprocess calls per commit (330ms git overhead)
- Throughput: 4.5 files/s, 35 KB/s
- Large repo (82K files): 50+ minutes to index
Target Performance:
- 1 git call per commit using
git show --format="" commit(33ms overhead) - Throughput: 10-12 files/s, 70-90 KB/s
- Large repo: 15-20 minutes to index
Impact:
- 10x reduction in git operations
- 40% overall speedup
- 2-3x faster temporal indexing
Implementation Components
- Single git call implementation in TemporalDiffScanner
- Unified diff output parser
- File type detection from diff headers
- Blob hash extraction from index lines
- Binary file detection from diff output
- Override filtering integration preserved
- Unit tests (0/0 passing)
- Integration tests (0/0 passing)
- Performance tests (0/0 passing)
- E2E manual testing completed by Claude Code
Completion: 0/10 tasks complete (0%)
Algorithm
TemporalDiffScanner:
Data structures:
- diffs: List[DiffInfo] # Output collection
- current_file: FileInfo # Current file being parsed
- override_filter_service: OverrideFilterService
get_diffs_for_commit(commit_hash):
# PHASE 1: Single git call to get all changes
output = subprocess.run(["git", "show", "--format=", commit_hash])
# PHASE 2: Parse unified diff output
diffs = []
current_file = None
diff_content_lines = []
FOR each line IN output.split("\n"):
IF line starts with "diff --git":
# New file starting - save previous file if exists
IF current_file is not None:
IF should_include_file(current_file.path):
diff = create_diff_info(current_file, diff_content_lines)
diffs.append(diff)
# Parse file paths from: diff --git a/path b/path
current_file = parse_diff_header(line)
diff_content_lines = []
ELSE IF line starts with "new file mode":
current_file.type = "added"
ELSE IF line starts with "deleted file mode":
current_file.type = "deleted"
ELSE IF line starts with "rename from":
current_file.type = "renamed"
current_file.old_path = extract_path(line)
ELSE IF line starts with "index ":
# Extract blob hashes: index abc123..def456
current_file.old_blob, current_file.new_blob = parse_index_line(line)
ELSE IF line starts with "Binary files":
current_file.is_binary = True
ELSE IF line starts with ("---", "+++", "@@", "+", "-", " "):
# Diff content - accumulate
diff_content_lines.append(line)
# Don't forget last file
IF current_file is not None AND should_include_file(current_file.path):
diff = create_diff_info(current_file, diff_content_lines)
diffs.append(diff)
RETURN diffs
create_diff_info(file_info, diff_lines):
# Determine diff type
IF file_info.is_binary:
diff_type = "binary"
diff_content = f"Binary file {file_info.type}: {file_info.path}"
ELSE IF file_info.type == "renamed" AND no content changes:
diff_type = "renamed"
diff_content = f"File renamed from {file_info.old_path} to {file_info.path}"
ELSE:
diff_type = file_info.type # "added", "deleted", "modified"
diff_content = "\n".join(diff_lines)
RETURN DiffInfo(
file_path = file_info.path,
diff_type = diff_type,
diff_content = diff_content,
blob_hash = file_info.new_blob OR file_info.old_blob,
old_path = file_info.old_path,
parent_commit_hash = get_parent_if_needed(file_info)
)
Key Algorithm Features:
- Single git call gets all changes in unified diff format
- State machine parsing (track current file, accumulate diff lines)
- Header detection determines file type (added/deleted/modified/renamed/binary)
- Blob hash extraction from "index" line
- Override filtering applied per file
- Preserves all existing metadata and deduplication logic
Acceptance Criteria
Scenario 1: Single git call retrieves all commit changes
Given a commit with 10 files (4 added, 3 modified, 2 deleted, 1 renamed)
When get_diffs_for_commit() is called
Then exactly 1 git subprocess call is made
And the call is "git show --format= <commit_hash>"
And all 10 files are returned in diffs list
Scenario 2: Added files extracted correctly from unified diff
Given a commit that adds a new file with 100 lines
When parsing the unified diff output
Then diff_type is "added"
And diff_content contains all 100 lines prefixed with "+"
And blob_hash is extracted from "index" line
And file marked as new file mode
Scenario 3: Deleted files extracted correctly from unified diff
Given a commit that deletes a file with 50 lines
When parsing the unified diff output
Then diff_type is "deleted"
And diff_content contains all 50 lines prefixed with "-"
And blob_hash is extracted from old blob in "index" line
And parent_commit_hash is populated
Scenario 4: Modified files show only changes
Given a commit that modifies 20 lines in a 500-line file
When parsing the unified diff output
Then diff_type is "modified"
And diff_content contains only the changed lines with @@ markers
And diff_content does NOT contain all 500 lines
And blob_hash is extracted from new blob in "index" line
Scenario 5: Binary files detected from diff output
Given a commit with binary files (images, PDFs)
When parsing the unified diff output containing "Binary files differ"
Then diff_type is "binary"
And diff_content is metadata only (not binary data)
And file is marked as binary
Scenario 6: Renamed files without content changes
Given a commit that renames a file without modifying it
When parsing the unified diff output with "rename from/to"
Then diff_type is "renamed"
And old_path and new_path are both populated
And diff_content shows rename metadata
Scenario 7: Override filtering still works
Given a commit touching files in excluded directory (help/)
When get_diffs_for_commit() is called
Then excluded files are filtered out
And only non-excluded files appear in diffs list
And git call still executes only once
Scenario 8: Performance improvement measurable
Given temporal indexing on Evolution codebase (82K files)
When indexing 100 commits
Then git operation time reduces from ~330ms to ~33ms per commit
And overall throughput improves from 4.5 to 10-12 files/s
And total indexing time reduces by 30-40%Manual Testing Strategy
Manual Testing Approach:
-
Baseline Measurement (Before optimization):
- Run:
cidx index --index-commits --clearon test repository - Monitor with:
ps -L -p <PID> -o lwp,pcpu,state,wchan:30 - Capture: Git subprocess count, throughput (files/s), total time
- Run:
-
Implementation Testing (After optimization):
- Run:
cidx index --index-commits --clearon same repository - Monitor same metrics
- Verify: Single git call per commit (using process monitoring or logs)
- Run:
-
Functional Verification:
- Run:
cidx query "authentication" --temporal - Verify: Results include added, modified, and deleted files
- Verify: Results match pre-optimization behavior (same content indexed)
- Run:
-
Large Repository Validation:
- Test on Evolution codebase (82,742 files)
- Measure: Time to index 1000 commits
- Expected: 30-40% reduction in indexing time
-
Edge Case Testing:
- Test commit with only binary files
- Test commit with only renamed files
- Test commit with excluded files (help/ directory)
- Verify: All handled correctly without errors
Evidence to Capture:
- Before/after git subprocess counts (strace or process monitoring)
- Before/after throughput metrics (files/s, KB/s)
- Before/after total indexing time for same commit range
- Query results proving search functionality intact
- Screenshots or logs of progress display showing improved speed
Testing Requirements
- Unit tests covering all diff parsing logic paths (added/deleted/modified/renamed/binary)
- Integration tests for git call reduction (12 calls → 1 call)
- Performance tests measuring git overhead reduction (330ms → 33ms)
- E2E tests validating complete temporal indexing workflow
- Manual testing via running cidx on Evolution codebase and measuring performance
Definition of Done
- ✅ All acceptance criteria satisfied
- ✅ >90% unit test coverage achieved
- ✅ Integration tests passing
- ✅ E2E tests with zero mocking passing
- ✅ Code review approved
- ✅ Manual end-to-end testing completed by Claude Code
- ✅ No lint/type errors
- ✅ Working software deployable to users
- ✅ Performance improvement validated (2-3x faster measured)