[STORY] Git History Indexing with Blob Deduplication and Branch Metadata

**Part of**: #274

**Note**: Story content truncated due to GitHub issue size limit. See file for full details.

# Story: Git History Indexing with Blob Deduplication and Branch Metadata

## Story Description

**As a** AI coding agent
**I want to** index a repository's git history with storage deduplication and branch awareness
**So that** I can search across historical code cost-effectively while preserving branch context

**Conversation Context:**
- User specified need for semantic search across git history to find removed code
- Emphasized storage efficiency via git blob deduplication (92%+ savings achieved)
- Required to handle 40K+ commit repositories efficiently
- Analysis of Evolution repo (1,135 branches, 89K commits) showed single-branch indexing covers 91.6% of commits
- Default to current branch only, opt-in for all branches to avoid 85% storage increase

---

## ⚠️ CRITICAL IMPLEMENTATION INSTRUCTION

**STOP AFTER COMPLETION:** When asked to "start implementing" the Temporal Epic, implement ONLY Story 1 (this story) and then STOP. Do not proceed to Story 2 or subsequent stories without explicit approval.

**Implementation Checkpoint Workflow:**
1. Implement Story 1 completely (TDD workflow with all tests passing)
2. Run code review and manual testing
3. Commit changes
4. **STOP and wait for user review/approval**
5. Only proceed to Story 2 after user explicitly approves

**Rationale:** This is the foundational story establishing temporal indexing architecture. User must review and validate the implementation approach before building dependent features on top of it.

## Acceptance Criteria

### Core Functionality (Both Modes)
- [ ] Running `cidx index --index-commits` indexes current branch only (default behavior)
- [ ] Running `cidx index --index-commits --all-branches` indexes all branches
- [ ] Creates SQLite database at `.code-indexer/index/temporal/commits.db` with commit graph
- [ ] Creates `commit_branches` table tracking which branches each commit appears in
- [ ] Builds blob registry at `.code-indexer/index/temporal/blob_registry.db` (SQLite) mapping blob_hash → point_ids
- [ ] Reuses existing vectors for blobs already in HEAD (deduplication)
- [ ] Only embeds new blobs not present in current HEAD
- [ ] Stores temporal metadata including branch information in `.code-indexer/index/temporal/temporal_meta.json`

### Daemon Mode Functionality
- [ ] Temporal indexing works when `daemon.enabled: true` in config
- [ ] CLI automatically delegates to daemon via `_index_via_daemon(index_commits=True)`
- [ ] Daemon's `exposed_index_blocking()` handles temporal indexing via TemporalIndexer
- [ ] Progress callbacks stream from daemon to CLI in real-time
- [ ] Cache invalidated before temporal indexing starts (daemon mode)
- [ ] Graceful fallback to standalone mode if daemon unavailable
- [ ] All flags (`--all-branches`, `--max-commits`, `--since-date`) passed through delegation

### User Experience
- [ ] Shows progress during indexing: "Indexing commits: 500/5000 (10%) [development branch]"
- [ ] Displays cost warning before indexing all branches: "⚠️ Indexing 715 branches will use ~514MB storage and cost ~$4.74"
- [ ] Requires user confirmation (y/N) for --all-branches in large repos (>50 branches)
- [ ] Shows final statistics: branches indexed, commits per branch, deduplication ratio

### Performance
- [ ] Achieves >92% storage savings through blob deduplication
- [ ] Handles large repositories (40K+ commits, 1000+ branches) without running out of memory
- [ ] Single-branch indexing is fast (similar to current indexing performance)

## Technical Architecture (CRITICAL - Read First)

### What We Index: Full Blob Versions, NOT Diffs

**CRITICAL DECISION:** We index **complete file versions (blobs)** at each point in git history, NOT diffs.

```
✅ CORRECT: Index full blob content
Commit abc123: user.py (blob def456)
→ Index complete file with full class User, all methods, imports, context

❌ WRONG: Index diffs
Commit abc123: user.py diff
→ Would only index: "+ def greet(): ..." (no context, can't do semantic search)
```

**Rationale:**
1. **Semantic search requires full context** - Users query "authentication with JWT" needs complete class implementation
2. **Users want complete implementations** - "Find removed auth code" must return full working code, not fragments
3. **Git stores blobs efficiently** - Git handles compression/deduplication internally
4. **Better deduplication** - Same file content = same blob hash = reuse existing vectors

**Example Query:** "function that handles JWT authentication"
- Full blob approach: Finds complete `AuthManager` class with all context ✅
- Diff approach: Only finds "+ jwt.decode()" line without context ❌

### Component Architecture and Reuse Strategy

**Pipeline Reuse (85% of workspace indexing code):**

```
Workspace Indexing (Current HEAD):
  Disk Files → FileIdentifier → FixedSizeChunker.chunk_file()
    → VectorCalculationManager → FilesystemVectorStore

Git History Indexing (This Story):
  Git Blobs → GitBlobReader → FixedSizeChunker.chunk_text()
    → VectorCalculationManager → FilesystemVectorStore
           ↑                         ↑
        SAME COMPONENTS REUSED (85%)
```

**✅ Reused Components (No Changes):**
- `VectorCalculationManager` - Parallel embedding generation (VoyageAI API)
- `FilesystemVectorStore` - JSON vector storage (already has blob_hash field)
- Threading patterns (`ThreadPoolExecutor`, `CleanSlotTracker`)
- Progress callback mechanism

**⚙️ Modified Components (Minor Changes):**
- `FixedSizeChunker` - Add `chunk_text(text, source)` method for pre-loaded text

**🆕 New Git-Specific Components:**

1. **TemporalBlobScanner** - Discovers blobs in git history
   ```python
   def get_blobs_for_commit(commit_hash: str) -> List[BlobInfo]:
       """Uses: git ls-tree -r <commit_hash>"""
       # Returns: [(file_path, blob_hash, size), ...]
   ```

2. **GitBlobReader** - Reads blob content from git object store
   ```python
   def read_blob_content(blob_hash: str) -> str:
       """Uses: git cat-file blob <blob_hash>"""
       # Returns: Full file content as string
   ```

3. **HistoricalBlobProcessor** - Parallel blob processing (analogous to HighThroughputProcessor)
   ```python
   def process_blobs_high_throughput(blobs: List[BlobInfo]) -> Stats:
       """
       Orchestrates: blob → read → chunk → vector → store
       Reuses: VectorCalculationManager + FilesystemVectorStore
       """
   ```

### Deduplication Flow (92% Vector Reuse)

**Key Insight:** Most blobs across history already have vectors from HEAD indexing.

```python
# For each commit (e.g., 150 blobs per commit)
for commit in commits:
    # Step 1: Get all blobs in commit
    all_blobs = scanner.get_blobs_for_commit(commit.hash)  # 150 blobs

    # Step 2: Check blob registry (SQLite lookup, microseconds)
    new_blobs = []
    for blob in all_blobs:
        if not blob_registry.has_blob(blob.blob_hash):  # Fast SQLite query
            new_blobs.append(blob)  # Only ~12 blobs are new (92% dedup)

    # Step 3: Process ONLY new blobs (skip 138 existing)
    if new_blobs:
        # Use HistoricalBlobProcessor (reuses VectorCalculationManager)
        blob_processor.process_blobs_high_throughput(
            new_blobs,  # Only 12 blobs instead of 150
            vector_thread_count=8
        )

    # Step 4: Link commit → all blobs (new + reused) in SQLite
    store_commit_metadata(commit, all_blobs)
```

**Result:** 150 blobs → 12 embeddings → 92% savings

### Performance Expectations (42K files, 10GB repo)

**First Run:**
- Estimated 150,000 unique blobs across history
- 92% deduplication (most files unchanged across commits)
- Only ~12,000 new blobs need embedding
- 8 parallel threads with VoyageAI batch processing
- **Time: 4-7 minutes** (similar to workspace indexing)

**Incremental (New Commits):**
- Only process blobs from new commits
- High deduplication (most files unchanged)
- **Time: <1 minute**

**Bottlenecks:**
1. Git blob extraction (`git cat-file`) - slower than disk reads but parallelized
2. VoyageAI API calls - same as workspace (token-aware batching, 120K limit)
3. SQLite blob registry lookups - microseconds (indexed)

---

## Technical Implementation

### Entry Point (CLI)
```python
# In cli.py index command
@click.option("--index-commits", is_flag=True,
              help="Index git commit history for temporal search (current branch only)")
@click.option("--all-branches", is_flag=True,
              help="Index all branches (requires --index-commits, may increase storage significantly)")
@click.option("--max-commits", type=int,
              help="Maximum number of commits to index per branch (default: all)")
@click.option("--since-date",
              help="Index commits since date (YYYY-MM-DD)")
def index(..., index_commits, all_branches, max_commits, since_date):
    if index_commits:
        # Lazy import for performance
        from src.code_indexer.services.temporal_indexer import TemporalIndexer

        temporal_indexer = TemporalIndexer(config_manager, vector_store)

        # Cost estimation and warning for all-branches
        if all_branches:
            cost_estimate = temporal_indexer.estimate_all_branches_cost()
            console.print(Panel(
                f"⚠️  [yellow]Indexing all branches will:[/yellow]\n"
                f"  • Process {cost_estimate.additional_commits:,} additional commits\n"
                f"  • Create {cost_estimate.additional_blobs:,} new embeddings\n"
                f"  • Use {cost_estimate.storage_mb:.1f} MB additional storage\n"
                f"  • Cost ~${cost_estimate.api_cost:.2f} in VoyageAI API calls",
                title="Cost Warning",
                border_style="yellow"
            ))

            if cost_estimate.total_branches > 50:
                if not click.confirm("Continue with all-branches indexing?", default=False):
                    console.print("[yellow]Cancelled. Using single-branch mode.[/yellow]")
                    all_branches = False

        result = temporal_indexer.index_commits(
            all_branches=all_branches,
            max_commits=max_commits,
            since_date=since_date
        )
```

### Core Implementation (TemporalIndexer Orchestration)

**CRITICAL:** TemporalIndexer orchestrates the flow but delegates actual blob processing to HistoricalBlobProcessor.

```python
class TemporalIndexer:
    def __init__(self, config_manager, vector_store):
        self.config = config_manager.get_config()
        self.vector_store = vector_store
        self.db_path = Path(".code-indexer/index/temporal/commits.db")

        # Initialize git-specific components
        self.blob_scanner = TemporalBlobScanner(self.config.codebase_dir)
        self.blob_reader = GitBlobReader(self.config.codebase_dir)

        # Initialize blob registry
        self.blob_registry = BlobRegistry(
            Path(".code-indexer/index/temporal/blob_registry.db")
        )

        # Initialize blob processor (reuses VectorCalculationManager)
        from .embedding_factory import EmbeddingProviderFactory
        embedding_provider = EmbeddingProviderFactory.create(config=self.config)

        self.blob_processor = HistoricalBlobProcessor(
            config=self.config,
            embedding_provider=embedding_provider,
            vector_store=vector_store,
            blob_registry=self.blob_registry
        )

    def index_commits(self, all_branches: bool = False,
                      max_commits: Optional[int] = None,
                      since_date: Optional[str] = None,
                      progress_callback: Optional[Callable] = None) -> IndexingResult:
        """Index git history with blob deduplication and branch tracking.

        This method orchestrates but DELEGATES blob processing to
        HistoricalBlobProcessor (which reuses VectorCalculationManager).
        """

        # Step 1: Build blob registry from existing vectors (SQLite)
        self.blob_registry.build_from_vector_store(self.vector_store)

        # Step 2: Get commit history from git (with branch info)
        commits = self._get_commit_history(all_branches, max_commits, since_date)
        current_branch = self._get_current_branch()

        # Step 3: Process each commit
        total_blobs_processed = 0
        total_vectors_created = 0

        for i, commit in enumerate(commits):
            # 3a. Discover all blobs in this commit (git ls-tree)
            all_blobs = self.blob_scanner.get_blobs_for_commit(commit.hash)

            # 3b. Filter to ONLY new blobs (deduplication check)
            new_blobs = []
            for blob_info in all_blobs:
                if not self.blob_registry.has_blob(blob_info.blob_hash):
                    new_blobs.append(blob_info)

            # 3c. Process new blobs using HistoricalBlobProcessor
            #     (This reuses VectorCalculationManager + FilesystemVectorStore)
            if new_blobs:
                stats = self.blob_processor.process_blobs_high_throughput(
                    new_blobs,
                    vector_thread_count=8,
                    progress_callback=progress_callback
                )
                total_vectors_created += stats.vectors_created

            total_blobs_processed += len(all_blobs)

            # 3d. Store commit metadata in SQLite (links commit → blobs)
            self._store_commit_tree(commit, all_blobs)

            # 3e. Store branch metadata for THIS COMMIT (CRITICAL: During processing)
            # DO NOT defer this to after the loop - branch metadata must be stored
            # as we process each commit for accuracy and to avoid expensive lookups later
            self._store_commit_branch_metadata(
                commit_hash=commit.hash,
                all_branches_mode=all_branches,
                current_branch=current_branch
            )

            # Progress with branch info
            if progress_callback:
                branch_info = f" [{current_branch}]" if not all_branches else ""
                progress_callback(
                    i + 1,
                    len(commits),
                    Path(f"commit {commit.hash[:8]}"),
                    info=f"{i+1}/{len(commits)} commits{branch_info}"
                )

        # Step 5: Save temporal metadata with branch info
        branch_stats = self._calculate_branch_statistics(commits, all_branches)
        self._save_temporal_metadata(
            last_commit=commits[-1].hash,
            total_commits=len(commits),
            total_blobs=total_blobs_processed,
            new_blobs=total_vectors_created // 3,  # Approx (3 chunks/file avg)
            branch_stats=branch_stats,
            indexing_mode='all-branches' if all_branches else 'single-branch'
        )

        return IndexingResult(
            total_commits=len(commits),
            unique_blobs=total_blobs_processed,
            new_blobs_indexed=total_vectors_created // 3,
            deduplication_ratio=1 - (total_vectors_created / (total_blobs_processed * 3)),
            branches_indexed=branch_stats.branches,
            commits_per_branch=branch_stats.per_branch_counts
        )
```

### New Component: TemporalBlobScanner

```python
@dataclass
class BlobInfo:
    """Information about a blob in git history."""
    blob_hash: str      # Git's blob hash (for deduplication)
    file_path: str      # Relative path in repo
    commit_hash: str    # Which commit this blob appears in
    size: int           # Blob size in bytes

class TemporalBlobScanner:
    """Discovers blobs in git history."""

    def __init__(self, codebase_dir: Path):
        self.codebase_dir = codebase_dir

    def get_blobs_for_commit(self, commit_hash: str) -> List[BlobInfo]:
        """Get all blobs in a commit's tree.

        Uses: git ls-tree -r -l <commit_hash>
        """
        cmd = ["git", "ls-tree", "-r", "-l", commit_hash]
        result = subprocess.run(
            cmd,
            cwd=self.codebase_dir,
            capture_output=True,
            text=True,
            check=True
        )

        blobs = []
        for line in result.stdout.strip().split("\n"):
            if not line:
                continue

            # Format: <mode> <type> <hash> <size>\t<path>
            # Example: 100644 blob abc123def456 1234\tsrc/module.py
            parts = line.split()
            if len(parts) >= 4 and parts[1] == "blob":
                blob_hash = parts[2]
                size = int(parts[3])
                file_path = line.split("\t", 1)[1]

                blobs.append(BlobInfo(
                    blob_hash=blob_hash,
                    file_path=file_path,
                    commit_hash=commit_hash,
                    size=size
                ))

        return blobs
```

### New Component: GitBlobReader

```python
class GitBlobReader:
    """Reads blob content from git object store."""

    def __init__(self, codebase_dir: Path):
        self.codebase_dir = codebase_dir

    def read_blob_content(self, blob_hash: str) -> str:
        """Extract blob content as text.

        Uses: git cat-file blob <blob_hash>
        """
        cmd = ["git", "cat-file", "blob", blob_hash]
        result = subprocess.run(
            cmd,
            cwd=self.codebase_dir,
            capture_output=True,
            text=True
        )

        if result.returncode != 0:
            raise ValueError(f"Failed to read blob {blob_hash}: {result.stderr}")

        return result.stdout
```

### New Component: HistoricalBlobProcessor

**CRITICAL:** This component reuses VectorCalculationManager and FilesystemVectorStore.

```python
class HistoricalBlobProcessor:
    """Processes historical git blobs with parallel vectorization.

    Similar to HighThroughputProcessor but for git blobs instead of disk files.
    Reuses: VectorCalculationManager + FilesystemVectorStore
    """

    def __init__(self, config, embedding_provider, vector_store, blob_registry):
        self.config = config
        self.embedding_provider = embedding_provider
        self.vector_store = vector_store
        self.blob_registry = blob_registry
        self.blob_reader = GitBlobReader(config.codebase_dir)
        self.chunker = FixedSizeChunker()  # Will use chunk_text() method

    def process_blobs_high_throughput(
        self,
        blobs: List[BlobInfo],
        vector_thread_count: int,
        progress_callback: Optional[Callable] = None
    ) -> BlobProcessingStats:
        """Process blobs with parallel vectorization.

        Uses SAME architecture as HighThroughputProcessor:
        - VectorCalculationManager for parallel embeddings
        - FilesystemVectorStore for vector storage
        - ThreadPoolExecutor for parallel blob processing
        """
        stats = BlobProcessingStats()

        # ✅ REUSE VectorCalculationManager (unchanged)
        with VectorCalculationManager(
            self.embedding_provider, vector_thread_count
        ) as vector_manager:

            # Parallel blob processing
            with ThreadPoolExecutor(max_workers=vector_thread_count) as executor:
                futures = []
                for blob_info in blobs:
                    future = executor.submit(
                        self._process_single_blob,
                        blob_info,
                        vector_manager
                    )
                    futures.append((future, blob_info))

                # Collect results as they complete
                for i, (future, blob_info) in enumerate(futures):
                    try:
                        result = future.result()
                        stats.blobs_processed += 1
                        stats.vectors_created += result.chunks_processed

                        # Progress callback
                        if progress_callback:
                            progress_callback(
                                i + 1,
                                len(blobs),
                                Path(blob_info.file_path),
                                info=f"{i+1}/{len(blobs)} blobs"
                            )
                    except Exception as e:


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[STORY] Git History Indexing with Blob Deduplication and Branch Metadata #460

Story: Git History Indexing with Blob Deduplication and Branch Metadata

Story Description

⚠️ CRITICAL IMPLEMENTATION INSTRUCTION

Acceptance Criteria

Core Functionality (Both Modes)

Daemon Mode Functionality

User Experience

Performance

Technical Architecture (CRITICAL - Read First)

What We Index: Full Blob Versions, NOT Diffs

Component Architecture and Reuse Strategy

Deduplication Flow (92% Vector Reuse)

Performance Expectations (42K files, 10GB repo)

Technical Implementation

Entry Point (CLI)

Core Implementation (TemporalIndexer Orchestration)

New Component: TemporalBlobScanner

New Component: GitBlobReader

New Component: HistoricalBlobProcessor

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[STORY] Git History Indexing with Blob Deduplication and Branch Metadata #460

Description

Story: Git History Indexing with Blob Deduplication and Branch Metadata

Story Description

⚠️ CRITICAL IMPLEMENTATION INSTRUCTION

Acceptance Criteria

Core Functionality (Both Modes)

Daemon Mode Functionality

User Experience

Performance

Technical Architecture (CRITICAL - Read First)

What We Index: Full Blob Versions, NOT Diffs

Component Architecture and Reuse Strategy

Deduplication Flow (92% Vector Reuse)

Performance Expectations (42K files, 10GB repo)

Technical Implementation

Entry Point (CLI)

Core Implementation (TemporalIndexer Orchestration)

New Component: TemporalBlobScanner

New Component: GitBlobReader

New Component: HistoricalBlobProcessor

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions