-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Epic: Temporal Git History Semantic Search
Epic Overview
Problem Statement:
AI coding agents (like Claude Code) need semantic search across git history to find removed code, understand pattern evolution, prevent regressions, and debug with historical context. Current tools (git blame, git log) only provide text-based or current-state views.
Target Users:
- Claude Code and other AI coding agents working on codebases
- Developers needing historical context for debugging
- Teams tracking code evolution patterns
Success Criteria:
- Semantic temporal queries working across git history
- Code evolution visualization with commit messages, diffs, and branch context
- 92%+ storage savings via git blob deduplication
- Query performance <300ms on 40K+ repos
- Backward compatibility maintained (space-only search unchanged)
- Cost-effective default behavior (single branch) with opt-in for complete multi-branch coverage
CRITICAL BLOB INDEXING MODEL:
We index each unique blob hash ONCE, regardless of how many commits reference it. The temporal database (commits.db) tracks which commits reference each blob through the trees table.
Example: If blob abc123 (containing user.py with authentication code) appears in 100 commits across history:
- Embeddings: Created ONCE for blob
abc123(12 chunks → 12 vectors) - Temporal Links: 100 rows in
treestable linking commits to blobabc123 - Storage Savings: 92% reduction (1 embedding set vs 100 duplicates)
- Queries: All 100 commits can find this code via shared vectors
Why This Matters:
- Deduplication is critical: 42K file repo with 150K historical blobs → Only 12K unique blobs need embeddings
- Query correctness: Multiple commits correctly return same removed code
- Performance: 4-7 minutes vs 50+ hours without deduplication
- Cost: $50 vs $600+ in API calls without deduplication
Branch Strategy (Based on Analysis of Evolution Repository):
- Default: Index current branch only (development/main) - provides 91.6% commit coverage
- Opt-in:
--all-branchesflag for complete multi-branch history - Rationale: Indexing all branches in large repos increases storage by ~85% for only ~8% more commits
- Flexibility: Track branch metadata for all indexed commits to support future branch-aware queries
Features
00_Story_BackgroundIndexRebuilding (PREREQUISITE)
Purpose: Establish foundational locking mechanism for atomic index updates
Story: Background Index Rebuilding with Atomic Swap
Status: MUST BE IMPLEMENTED FIRST - All subsequent stories depend on this infrastructure
Implementation Order: Story 0 → STOP for review → Story 1 (after approval)
01_Feat_TemporalIndexing
Purpose: Build and maintain temporal index of git history with branch awareness
Stories:
- 01_Story_GitHistoryIndexingWithBlobDedup: Index repository git history with deduplication and branch metadata
- 02_Story_IncrementalIndexingWithWatch: Incremental indexing with watch mode integration
- 03_Story_SelectiveBranchIndexing: Selective branch indexing with pattern matching
02_Feat_TemporalQueries
Purpose: Enable semantic search across git history with temporal filters
Stories:
- 01_Story_TimeRangeFiltering: Query with time-range filtering
- 02_Story_PointInTimeQuery: Query at specific commit
03_Feat_CodeEvolutionVisualization
Purpose: Display code evolution with diffs and commit context
Stories:
- 01_Story_EvolutionDisplayWithCommitContext: Display evolution timeline with diffs
04_Feat_APIServerTemporalRegistration
Purpose: Enable golden repository registration with temporal indexing via API
Stories:
- 01_Story_GoldenRepoRegistrationWithTemporal: Admin registers golden repos with enable_temporal flag
05_Feat_APIServerTemporalQuery
Purpose: Enable temporal search via API with time-range, point-in-time, and evolution queries
Stories:
- 01_Story_TemporalQueryParametersViaAPI: Users query with temporal parameters via POST /api/query
Technical Architecture
Storage Architecture
SQLite Database (.code-indexer/index/temporal/commits.db):
CREATE TABLE commits (
hash TEXT PRIMARY KEY,
date INTEGER NOT NULL,
author_name TEXT,
author_email TEXT,
message TEXT,
parent_hashes TEXT
);
CREATE TABLE trees (
commit_hash TEXT NOT NULL,
file_path TEXT NOT NULL,
blob_hash TEXT NOT NULL,
PRIMARY KEY (commit_hash, file_path),
FOREIGN KEY (commit_hash) REFERENCES commits(hash)
);
-- NEW: Branch metadata tracking
CREATE TABLE commit_branches (
commit_hash TEXT NOT NULL,
branch_name TEXT NOT NULL,
is_head INTEGER DEFAULT 0, -- 1 if this was HEAD when indexed
indexed_at INTEGER NOT NULL, -- Unix timestamp
PRIMARY KEY (commit_hash, branch_name),
FOREIGN KEY (commit_hash) REFERENCES commits(hash)
);
-- Performance indexes for 40K+ repos
CREATE INDEX idx_trees_blob_commit ON trees(blob_hash, commit_hash);
CREATE INDEX idx_commits_date_hash ON commits(date, hash);
CREATE INDEX idx_trees_commit ON trees(commit_hash);
CREATE INDEX idx_commit_branches_hash ON commit_branches(commit_hash);
CREATE INDEX idx_commit_branches_name ON commit_branches(branch_name);Blob Registry (.code-indexer/index/temporal/blob_registry.db):
- SQLite database mapping blob_hash → point_ids from existing vectors
- Required for large-scale repos (40K+ files, 10GB+ with history)
- Indexed for fast lookups (microseconds per blob)
- Lazy connection with result caching
Temporal Metadata (.code-indexer/index/temporal/temporal_meta.json):
- Tracks last_indexed_commit, indexing state, statistics
- Records indexed_branches (list of branch names/patterns)
- Stores indexing_mode ('single-branch' or 'all-branches')
- Branch-specific stats: commits_per_branch, deduplication_ratio
Component Architecture
New Components:
TemporalIndexer- Orchestrates git history indexing (mode-agnostic)TemporalBlobScanner- Discovers blobs in git history viagit ls-treeGitBlobReader- Reads blob content from git object store viagit cat-fileHistoricalBlobProcessor- Parallel blob processing (analogous to HighThroughputProcessor)TemporalSearchService- Handles temporal queries (mode-agnostic)TemporalFormatter- Formats temporal results with Rich
Integration Points:
- CLI: New flags for index and query commands
- Config:
enable_temporalsetting - Watch Mode: Maintains temporal index if enabled
- Daemon Mode: Automatic delegation when
daemon.enabled: true
Progress Callback Specification (CRITICAL)
Standard Signature (All Modes):
def progress_callback(
current: int,
total: int,
path: Path,
info: str = ""
) -> None:
"""
Universal progress callback for indexing operations.
Args:
current: Current progress count (files, blobs, commits processed)
total: Total count (0 for setup messages, >0 for progress bar)
path: Path being processed (file path or empty Path("") for setup)
info: Formatted progress string (specific format required for CLI)
CLI Format Requirements:
- Setup messages (total=0): info="Setup message text"
Triggers ℹ️ scrolling display
- File progress (total>0): info="X/Y files (%) | emb/s | threads | filename"
Triggers progress bar with metrics display
- CRITICAL: Do not change format without updating cli.py progress_callback logic
Daemon Mode Requirements:
- Must be RPyC-serializable (primitives only: int, str, Path)
- No complex objects (no Path operations during callback)
- Callback executed in daemon process, results streamed to client
Thread Safety Requirements:
- Callback MUST be thread-safe (called from multiple worker threads)
- Use locks for any shared state updates
- Keep callback execution fast (<1ms) to avoid blocking workers
"""Temporal Indexing Usage:
# Setup phase (total=0 triggers ℹ️ display)
progress_callback(0, 0, Path(""), info="Scanning git history...")
progress_callback(0, 0, Path(""), info="Found 40,123 commits to index")
progress_callback(0, 0, Path(""), info="Deduplicating blobs (92% expected savings)...")
# Blob processing phase (total>0 triggers progress bar)
for i, blob in enumerate(blobs_to_process):
# Format: "X/Y blobs (%) | emb/s | threads | blob_description"
info = f"{i+1}/{total} blobs ({percent}%) | {emb_per_sec:.1f} emb/s | {threads} threads | {blob.tree_path}"
progress_callback(i+1, total, Path(blob.tree_path), info=info)RPyC Serialization (Daemon Mode):
# CORRECT: Simple types serialize over RPyC
progress_callback(
current=42, # int: serializable ✅
total=1000, # int: serializable ✅
path=Path("src/file.py"), # Path: serializable ✅
info="42/1000 files (4%)" # str: serializable ✅
)
# WRONG: Complex objects fail serialization
progress_callback(
current=42,
total=1000,
path=Path("src/file.py"),
info={"files": 42, "total": 1000} # dict: NOT serializable ❌
)Correlation IDs (Future Enhancement):
When implementing multi-operation tracking, consider adding correlation IDs:
def progress_callback(
current: int,
total: int,
path: Path,
info: str = "",
correlation_id: Optional[str] = None # Links related progress updates
) -> None:
"""Correlation ID enables ordering progress from concurrent operations."""Thread Safety Pattern:
class TemporalIndexer:
def __init__(self, progress_callback):
self.progress_callback = progress_callback
self.callback_lock = threading.Lock() # Protect callback invocation
self.progress_cache = {} # Cache for concurrent_files display
def _report_progress(self, current, total, path, info):
"""Thread-safe progress reporting."""
with self.callback_lock:
self.progress_callback(current, total, path, info)Performance Requirements:
- Callback execution: <1ms (avoid blocking worker threads)
- Call frequency: ~10-50 per second during active processing
- Network overhead (daemon): ~10-20ms latency for RPC round-trip
- Total progress overhead: <5% of processing time
Indexing Pipeline Reuse Strategy (CRITICAL)
What We Index: Full Blob Versions, NOT Diffs
Git blobs represent complete file versions at specific points in time. We index full blob content for semantic search, not diffs:
✅ CORRECT: Index full blob content
Commit abc123: user.py (blob def456)
→ Complete file: class User with all methods and context
❌ WRONG: Index diffs
Commit abc123: user.py diff
→ Partial: +def greet(): ... (no class context)
Why Full Blobs:
- Semantic search requires full context - can't find code patterns in partial diffs
- Users want complete implementations - "find removed authentication code" needs full class
- Git already stores blobs efficiently - compression + deduplication is built-in
- Deduplication works better - same content across commits = same blob hash = reuse vectors
Pipeline Component Reuse (60-65% Reuse Rate):
Reality Check: While the core embedding/storage pipeline is highly reusable, adapting it for git blob processing requires more new code than initially estimated. The breakdown below reflects realistic implementation complexity.
✅ Fully Reusable (~40% of total implementation):
VectorCalculationManager- Takes text chunks → embeddings (source-agnostic, zero changes)FilesystemVectorStore- Writes vector JSON files (already supports blob_hash in metadata)- Threading infrastructure -
ThreadPoolExecutor,CleanSlotTracker(reusable patterns)
🔧 Requires Modification (~25% of total implementation):
FixedSizeChunker- Already haschunk_text(text, file_path)method, but needs blob-specific metadata handlingHighThroughputProcessor- Core patterns reusable, but needs adaptation for blob queue instead of file queue- Progress callback mechanism - Signature compatible, but needs blob-specific tracking (commit hash, blob count)
🆕 New Git-Specific Components (~35% of total implementation):
TemporalIndexer- Orchestrates entire temporal indexing workflow (new coordinator)TemporalBlobScanner- Discovers blobs viagit ls-tree(replaces FileFinder's disk walking)GitBlobReader- Reads blob content viagit cat-file(replaces file I/O)HistoricalBlobProcessor- Manages blob queue and parallel processing (adapts HighThroughputProcessor patterns)TemporalSearchService- Handles temporal queries with SQLite filtering (new query layer)TemporalFormatter- Formats temporal results with Rich output (new display logic)
Adaptation Complexity:
- File → Blob Translation: Blobs have no filesystem path (use git object references)
- Metadata Differences: Blob hash, commit hash, tree path vs file path, line numbers
- Git Subprocess Integration:
git ls-tree,git cat-file,git logperformance tuning - SQLite Coordination: Blob registry, commit metadata, branch tracking integration
- Memory Management: 12K blob processing requires careful memory handling vs file-by-file
Architecture Comparison:
Workspace Indexing (HEAD):
Disk Files → FileIdentifier → FixedSizeChunker
→ VectorCalculationManager → FilesystemVectorStore
Git History Indexing (Temporal):
Git Blobs → GitBlobReader → FixedSizeChunker.chunk_text()
→ VectorCalculationManager → FilesystemVectorStore
↑ ↑
SAME COMPONENTS REUSED
Deduplication Strategy:
# For each commit
for commit in commits:
# 1. Get all blobs in commit (git ls-tree)
all_blobs = scanner.get_blobs_for_commit(commit) # 150 blobs
# 2. Check blob registry (SQLite lookup, microseconds)
new_blobs = [b for b in all_blobs if not registry.has_blob(b.hash)]
# Result: ~12 new blobs (92% already have vectors)
# 3. Process ONLY new blobs (reuse existing for rest)
if new_blobs:
processor.process_blobs_high_throughput(new_blobs)
# Uses VectorCalculationManager + FilesystemVectorStorePerformance Expectations (Repository Size Matters):
CRITICAL: Indexing time scales with (commits × files/commit). Larger repos take longer.
Benchmarked on Evolution Repo (89K commits, 27K files/commit, 9.2GB):
git log: 50,000+ commits/sec (extremely fast)git ls-tree: 19 commits/sec, 52.7ms/commit (bottleneck)git cat-file --batch: 419-869 blobs/sec, 1.2-2.4ms/blob (excellent)- Actual deduplication: 99.9% (better than 92% estimate)
Timing by Repository Size:
| Repo Size | Files/Commit | Commits | Unique Blobs | Indexing Time | Bottleneck |
|---|---|---|---|---|---|
| Small | 1-5K | 10-20K | 2-5K | 4-10 min | git ls-tree (9-18 min) |
| Medium | 5-10K | 40K | 12-16K | 30-45 min | git ls-tree (~35 min) |
| Large | 20K+ | 80K+ | 20-30K | 60-90 min | git ls-tree (~70 min) |
Component Breakdown (40K commit medium repo):
git log(40K commits): <1 mingit ls-tree(40K commits): 35 min⚠️ BOTTLENECK (80% of time)git cat-file(12K blobs): <1 min- Embedding generation (144K chunks): 3 min
- SQLite operations: 3 min
Incremental Indexing (All Sizes):
- Only new commit blobs → <1 minute ✅
Key Insights:
- ✅
git cat-fileis FAST (no optimization needed) ⚠️ git ls-treescales with repo size (fundamental git limitation)- ✅ Deduplication works BETTER than expected (99.9% vs 92%)
⚠️ Initial indexing time varies widely by repo size- ✅ Incremental updates are fast regardless of repo size
Progress Reporting Strategy:
Since git ls-tree consumes 80%+ of time, progress must show:
- "Processing commit X/Y" (commit-level progress)
- Commits/sec rate and ETA
- Clear indication this is normal (not stuck)
See Analysis Documents:
.analysis/temporal_indexing_pipeline_reuse_strategy.md- Complete implementation guide.analysis/temporal_blob_registry_sqlite_decision.md- SQLite rationale.tmp/git_performance_final_analysis.md- Evolution repo benchmarks (Nov 2, 2025)
Memory Management Strategy (CRITICAL)
Problem: Processing 12K unique blobs requires careful memory management to avoid OOM conditions on large repositories.
Blob Size Reality Check:
- Typical blob sizes: 50KB-500KB per file (median ~100KB)
- 12K blobs in memory: 1.2GB-6GB total (uncompressed)
- With chunking overhead: ~2-8GB peak memory
- Risk: Loading all blobs at once → OOM on systems with <16GB RAM
Streaming Batch Processing Strategy:
class HistoricalBlobProcessor:
"""Process blobs in memory-efficient batches."""
BATCH_SIZE = 500 # Process 500 blobs at a time
MAX_BATCH_MEMORY_MB = 512 # Target 512MB per batch
def process_blobs_in_batches(self, blob_hashes: List[str]):
"""
Stream blobs in batches to avoid OOM.
Memory Profile:
- 500 blobs × ~100KB avg = 50MB blob content
- Chunking overhead: ~2x (100MB for chunks)
- Embedding queue: ~3x (300MB for vectors)
- Peak: ~450MB per batch (safe for 4GB+ systems)
"""
for batch_start in range(0, len(blob_hashes), self.BATCH_SIZE):
batch_end = min(batch_start + self.BATCH_SIZE, len(blob_hashes))
batch = blob_hashes[batch_start:batch_end]
# 1. Read batch (streaming from git)
blob_contents = self._read_blobs_batch(batch)
# 2. Chunk batch
all_chunks = []
for blob_hash, content in blob_contents.items():
chunks = self.chunker.chunk_text(content, Path(blob_hash))
all_chunks.extend(chunks)
# 3. Generate embeddings (VectorCalculationManager)
embedding_futures = []
for chunk in all_chunks:
future = self.vector_manager.submit_batch_task([chunk["text"]], chunk)
embedding_futures.append(future)
# 4. Store vectors (FilesystemVectorStore)
for future in concurrent.futures.as_completed(embedding_futures):
result = future.result()
self.vector_store.upsert_points(collection_name, [result])
# 5. FREE MEMORY: Clear batch data before next iteration
del blob_contents, all_chunks, embedding_futures
gc.collect() # Force garbage collection between batches
# 6. Update progress
progress_callback(batch_end, len(blob_hashes), Path(""), info=f"Batch {batch_end}/{len(blob_hashes)}")Batch Size Selection:
| Batch Size | Memory Usage | Processing Time | Tradeoffs |
|---|---|---|---|
| 100 blobs | ~100MB peak | Slower (more batches) | Safe for 2GB systems |
| 500 blobs | ~450MB peak | Balanced | RECOMMENDED (4GB+ systems) |
| 1000 blobs | ~900MB peak | Faster (fewer batches) | Requires 8GB+ systems |
| 5000 blobs | ~4.5GB peak | Fastest | Risk: OOM on 8GB systems |
Decision: Default 500 blobs per batch (safe for 4GB+ systems, typical developer machines)
Dynamic Batch Sizing (Future Enhancement):
def _calculate_batch_size(self, available_memory_mb: int) -> int:
"""Adjust batch size based on available system memory."""
if available_memory_mb < 4096:
return 100 # Conservative for 2-4GB systems
elif available_memory_mb < 8192:
return 500 # Balanced for 4-8GB systems
else:
return 1000 # Aggressive for 8GB+ systemsOOM Prevention Mechanisms:
1. Memory Monitoring:
import psutil
def _check_memory_before_batch(self):
"""Verify sufficient memory before starting batch."""
memory = psutil.virtual_memory()
available_mb = memory.available / (1024 ** 2)
if available_mb < 1024: # Less than 1GB available
logger.warning(f"Low memory: {available_mb:.0f}MB available, reducing batch size")
self.BATCH_SIZE = max(50, self.BATCH_SIZE // 2)
if available_mb < 512: # Critical: less than 512MB
raise MemoryError(f"Insufficient memory: {available_mb:.0f}MB available, cannot proceed safely")2. Streaming Git Blob Reads:
def _read_blobs_batch(self, blob_hashes: List[str]) -> Dict[str, str]:
"""Stream blob content from git without loading all into memory."""
results = {}
# Use git cat-file --batch for efficient streaming
with subprocess.Popen(
["git", "cat-file", "--batch"],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
cwd=self.repo_path
) as proc:
for blob_hash in blob_hashes:
proc.stdin.write(f"{blob_hash}\n".encode())
proc.stdin.flush()
# Read blob header
header = proc.stdout.readline().decode()
if "missing" in header:
continue
# Read blob content (only this blob in memory)
size = int(header.split()[2])
content = proc.stdout.read(size).decode(errors='ignore')
results[blob_hash] = content
# Immediately process if memory-constrained
if len(results) >= self.BATCH_SIZE:
yield results
results.clear()
if results:
yield results3. Explicit Memory Cleanup:
def _cleanup_batch_memory(self):
"""Aggressively free memory after batch processing."""
# Clear any cached data
self.blob_cache.clear()
self.chunk_cache.clear()
# Force Python garbage collection
import gc
gc.collect()
# Log memory status
memory = psutil.virtual_memory()
logger.debug(f"Post-batch memory: {memory.available / (1024**2):.0f}MB available")SQLite Memory Configuration:
-- Limit SQLite memory usage
PRAGMA cache_size = 2000; -- 2MB cache (default -2000 = 2MB)
PRAGMA temp_store = MEMORY; -- Keep temp tables in memory (faster)
PRAGMA mmap_size = 268435456; -- 256MB memory-mapped I/O limitMemory Budget Allocation (Total: 4GB System):
| Component | Memory Budget | Notes |
|---|---|---|
| Blob batch content | 50MB | 500 blobs × 100KB avg |
| Chunking overhead | 100MB | 2x content for chunk processing |
| Embedding queue | 300MB | 3x for vector calculation |
| SQLite databases | 50MB | Blob registry + commits.db |
| FilesystemVectorStore writes | 100MB | JSON file writes |
| Python overhead | 200MB | Interpreter, libraries |
| OS buffer cache | 1GB | Git operations, file I/O |
| Safety margin | 2.2GB | Available for other processes |
| Total | 4GB | Safe for typical developer machines |
Validation Strategy:
def test_memory_usage_under_load():
"""Integration test: verify memory stays within bounds."""
import tracemalloc
tracemalloc.start()
# Process 12K blobs
processor.process_blobs_in_batches(blob_hashes)
current, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
# Assert: Peak memory < 1GB (safe margin)
assert peak < 1024 * 1024 * 1024, f"Peak memory {peak/(1024**2):.0f}MB exceeds 1GB limit"Configuration Options:
# .code-indexer/config.json
temporal:
batch_size: 500 # Blobs per batch
max_batch_memory_mb: 512 # Memory limit per batch
enable_memory_monitoring: true # Check available memory before batches
force_gc_between_batches: true # Garbage collect after each batchPerformance Impact:
- Streaming batches: +10-15% processing time vs loading all blobs
- Memory safety: Prevents OOM crashes
- Scalability: Works on 4GB systems, scales to 16GB+ systems
- Trade-off: Slightly slower, but reliable on typical hardware
Repository Lifecycle Integration
CRITICAL: Temporal indexing happens in GOLDEN REPOSITORIES with CoW inheritance to activated repos.
Architecture Overview:
-
Golden Repository (
~/.cidx-server/data/golden-repos/<alias>/):- Admin registers repo via API:
POST /api/registerwith index_types selection - Golden repo is cloned/indexed ONCE with selected index types
- Temporal indexing:
cidx index --index-commitsin golden repo - All indexes stored: Semantic (FilesystemVectorStore), FTS (Tantivy), Temporal (SQLite commits.db + blob_registry.db)
- Result: Complete multi-index golden repository ready for activation
- Admin registers repo via API:
-
Copy-on-Write (CoW) Inheritance (activated repos):
- User activates repo: Gets hardlink copy of ALL index data
- SQLite databases (commits.db, blob_registry.db) → CoW copied
- JSON chunk files (.code-indexer/index/) → CoW copied
- HNSW binary indexes → CoW copied
- FTS Tantivy indexes → CoW copied
- NO re-indexing required, instant activation
-
No Containers for Vector Storage:
- CRITICAL: Qdrant is legacy, NOT used anymore
- FilesystemVectorStore: Pure JSON files, no containers
- Temporal SQLite: Pure database files, no containers
- FTS Tantivy: Pure index files, no containers
- Only containers: Qdrant (legacy, unused), data-cleaner (optional)
API Server Index Class Selection:
When registering golden repos via POST /api/register, specify which index types to create:
{
"repo_url": "https://github.com/user/repo.git",
"index_types": ["semantic", "fts", "temporal"], // NEW: Select index classes
"temporal_options": { // NEW: Temporal-specific configuration
"all_branches": false, // Default: current branch only
"branch_patterns": ["main", "develop"], // Alternative: specific patterns
"max_commits": null, // Optional: limit history depth
"since_date": null // Optional: index commits after date
}
}Index Types:
"semantic": Default FilesystemVectorStore with HNSW (always included)"fts": Full-text search via Tantivy (optional, fast exact text matching)"temporal": Git history indexing with temporal queries (optional, this epic)
Combinations Allowed:
["semantic"]- Just vector search (default, minimal)["semantic", "fts"]- Vector + full-text search["semantic", "temporal"]- Vector + git history["semantic", "fts", "temporal"]- All three (comprehensive)
Implementation Notes:
- Registration becomes async job (long-running temporal indexing)
- Job status API:
GET /api/job/{job_id}for progress tracking - Golden repo stores index type metadata in config
- Activated repos inherit all index types from golden
- Users query using same
POST /api/querywith temporal parameters
Why This Matters:
- Scalability: Index once in golden, share via CoW to thousands of users
- Cost: Expensive temporal indexing (4-7 min, $50 API cost) happens ONCE
- Flexibility: Users choose which index types they need upfront
- Consistency: All activated repos have identical search capabilities
API Server Async Job Queue Architecture
CRITICAL: Long-running indexing operations MUST be async with job tracking.
Problem:
- Temporal indexing takes 4-7 minutes for large repos
- HTTP clients timeout waiting for synchronous response
- Users need progress visibility
- Multiple concurrent indexing jobs must be supported
Solution: Background Job Queue with Status API
Job Queue Implementation
# In src/code_indexer/server/job_queue.py
import threading
import uuid
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional, Dict, Any, Callable
from queue import Queue
class JobStatus(Enum):
"""Job lifecycle states."""
QUEUED = "queued" # Waiting in queue
RUNNING = "running" # Currently executing
COMPLETED = "completed" # Finished successfully
FAILED = "failed" # Failed with error
CANCELLED = "cancelled" # User cancelled
@dataclass
class Job:
"""Background job representation."""
id: str
type: str # "index_repository", "temporal_index", "rebuild_hnsw"
status: JobStatus
created_at: datetime
started_at: Optional[datetime] = None
completed_at: Optional[datetime] = None
progress: Dict[str, Any] = field(default_factory=dict)
result: Optional[Any] = None
error: Optional[str] = None
metadata: Dict[str, Any] = field(default_factory=dict)
class JobQueue:
"""
Background job queue for long-running operations.
Architecture:
- Single background worker thread processes jobs serially
- Jobs execute one at a time (prevents resource contention)
- Job status tracked in memory (no persistence needed)
- Progress callbacks update job.progress dictionary
"""
def __init__(self):
self.jobs: Dict[str, Job] = {} # job_id -> Job
self.queue: Queue = Queue()
self.worker_thread: Optional[threading.Thread] = None
self.running = False
def start(self):
"""Start background worker thread."""
if self.running:
return
self.running = True
self.worker_thread = threading.Thread(
target=self._worker_loop,
daemon=True,
name="JobQueueWorker"
)
self.worker_thread.start()
def stop(self):
"""Stop background worker thread."""
self.running = False
if self.worker_thread:
self.worker_thread.join(timeout=5)
def submit_job(
self,
job_type: str,
task_callable: Callable,
metadata: Dict[str, Any] = None
) -> str:
"""
Submit job to queue.
Args:
job_type: Job type identifier
task_callable: Function to execute (receives job for progress updates)
metadata: Optional metadata (repo_url, index_types, etc.)
Returns:
Job ID for status tracking
"""
job_id = str(uuid.uuid4())
job = Job(
id=job_id,
type=job_type,
status=JobStatus.QUEUED,
created_at=datetime.now(),
metadata=metadata or {}
)
self.jobs[job_id] = job
self.queue.put((job_id, task_callable))
return job_id
def get_job_status(self, job_id: str) -> Optional[Job]:
"""Get current job status."""
return self.jobs.get(job_id)
def cancel_job(self, job_id: str) -> bool:
"""
Cancel job (if not started yet).
Returns True if cancelled, False if already running/completed.
"""
job = self.jobs.get(job_id)
if not job:
return False
if job.status == JobStatus.QUEUED:
job.status = JobStatus.CANCELLED
job.completed_at = datetime.now()
return True
return False # Can't cancel running/completed jobs
def _worker_loop(self):
"""Background worker loop - processes jobs serially."""
import logging
logger = logging.getLogger(__name__)
while self.running:
try:
# Get next job (blocks with timeout)
job_id, task_callable = self.queue.get(timeout=1)
job = self.jobs.get(job_id)
if not job or job.status == JobStatus.CANCELLED:
continue
# Update job status
job.status = JobStatus.RUNNING
job.started_at = datetime.now()
logger.info(f"Job {job_id} ({job.type}): Started")
try:
# Execute task (pass job for progress updates)
result = task_callable(job)
# Mark completed
job.status = JobStatus.COMPLETED
job.result = result
job.completed_at = datetime.now()
logger.info(f"Job {job_id} ({job.type}): Completed")
except Exception as e:
# Mark failed
job.status = JobStatus.FAILED
job.error = str(e)
job.completed_at = datetime.now()
logger.error(f"Job {job_id} ({job.type}): Failed - {e}")
except Exception as e:
if self.running: # Ignore timeout exceptions during shutdown
logger.error(f"Worker loop error: {e}")API Endpoint Integration
# In src/code_indexer/server/api.py
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
from typing import List, Optional
from .job_queue import JobQueue, JobStatus
app = FastAPI()
job_queue = JobQueue()
# Start job queue on server startup
@app.on_event("startup")
async def startup_event():
job_queue.start()
@app.on_event("shutdown")
async def shutdown_event():
job_queue.stop()
class RepositoryRegistration(BaseModel):
"""Request body for repository registration."""
repo_url: str
index_types: List[str] = ["semantic"] # semantic, fts, temporal
temporal_options: Optional[dict] = None
class JobResponse(BaseModel):
"""Job submission response."""
job_id: str
status: str
message: str
@app.post("/api/register", response_model=JobResponse)
async def register_repository(request: RepositoryRegistration):
"""
Register repository for indexing (async).
Returns immediately with job_id for status tracking.
Actual indexing happens in background.
"""
def index_task(job):
"""Background indexing task."""
from ...services.smart_indexer import SmartIndexer
from ...services.temporal_indexer import TemporalIndexer
# Progress callback updates job.progress
def progress_callback(current, total, file_path, info=""):
job.progress = {
"current": current,
"total": total,
"file_path": str(file_path),
"info": info,
"percent": int((current / total * 100)) if total > 0 else 0
}
# Clone repository
job.progress = {"status": "Cloning repository..."}
golden_repo_path = clone_to_golden(request.repo_url)
# Index based on selected types
results = {}
if "semantic" in request.index_types:
job.progress = {"status": "Indexing semantic vectors..."}
smart_indexer = SmartIndexer(config_manager)
smart_indexer.index_repository(
repo_path=str(golden_repo_path),
progress_callback=progress_callback
)
results["semantic"] = "completed"
if "fts" in request.index_types:
job.progress = {"status": "Building FTS index..."}
# FTS indexing...
results["fts"] = "completed"
if "temporal" in request.index_types:
job.progress = {"status": "Indexing git history (4-7 min)..."}
temporal_indexer = TemporalIndexer(
repo_path=golden_repo_path,
config=config
)
temporal_indexer.index_commits(
all_branches=request.temporal_options.get("all_branches", False),
branch_patterns=request.temporal_options.get("branch_patterns"),
progress_callback=progress_callback
)
results["temporal"] = "completed"
return results
# Submit job to queue
job_id = job_queue.submit_job(
job_type="index_repository",
task_callable=index_task,
metadata={
"repo_url": request.repo_url,
"index_types": request.index_types
}
)
return JobResponse(
job_id=job_id,
status="queued",
message=f"Repository indexing queued. Track progress at GET /api/job/{job_id}"
)
@app.get("/api/job/{job_id}")
async def get_job_status(job_id: str):
"""
Get job status and progress.
Returns:
- status: queued, running, completed, failed, cancelled
- progress: current task info (updated in real-time)
- result: final result (if completed)
- error: error message (if failed)
"""
job = job_queue.get_job_status(job_id)
if not job:
raise HTTPException(status_code=404, detail="Job not found")
return {
"job_id": job.id,
"type": job.type,
"status": job.status.value,
"created_at": job.created_at.isoformat(),
"started_at": job.started_at.isoformat() if job.started_at else None,
"completed_at": job.completed_at.isoformat() if job.completed_at else None,
"progress": job.progress,
"result": job.result,
"error": job.error,
"metadata": job.metadata
}
@app.delete("/api/job/{job_id}")
async def cancel_job(job_id: str):
"""Cancel queued job (cannot cancel running jobs)."""
success = job_queue.cancel_job(job_id)
if not success:
job = job_queue.get_job_status(job_id)
if not job:
raise HTTPException(status_code=404, detail="Job not found")
raise HTTPException(
status_code=400,
detail=f"Cannot cancel job in status: {job.status.value}"
)
return {"message": "Job cancelled successfully"}Client Usage Pattern
# Example: Client polling for job completion
import requests
import time
# Submit indexing job
response = requests.post("http://localhost:8000/api/register", json={
"repo_url": "https://github.com/user/repo.git",
"index_types": ["semantic", "temporal"],
"temporal_options": {
"all_branches": False
}
})
job_id = response.json()["job_id"]
print(f"Job submitted: {job_id}")
# Poll for completion
while True:
status_response = requests.get(f"http://localhost:8000/api/job/{job_id}")
status = status_response.json()
print(f"Status: {status['status']}")
if status['progress']:
print(f"Progress: {status['progress']}")
if status['status'] in ['completed', 'failed', 'cancelled']:
break
time.sleep(2) # Poll every 2 seconds
if status['status'] == 'completed':
print(f"Success! Result: {status['result']}")
elif status['status'] == 'failed':
print(f"Failed: {status['error']}")Job Queue Characteristics
Why Single Worker Thread:
- Prevents resource contention (CPU, memory, API quota)
- Simplifies reasoning (one job at a time)
- Temporal indexing is CPU/memory intensive
- VoyageAI API has rate limits
- Serial execution = predictable behavior
Job Retention:
- Jobs kept in memory (no persistence)
- Job history cleared on server restart
- Acceptable: Registration is infrequent operation
- Enhancement: Add job expiration (delete after 24h)
Progress Reporting:
- Job object updated in real-time by worker
- Client polls
GET /api/job/{job_id}for updates - Progress includes: percent complete, current file, status message
- Same progress callback used for CLI and API
Error Handling:
- Exceptions caught and stored in job.error
- Job marked as FAILED
- Client can retrieve error details
- Partial progress preserved for debugging
Concurrency:
- Multiple clients can submit jobs (queued serially)
- Only one job executes at a time
- Queue grows if submission rate > processing rate
- Enhancement: Add queue size limit
Daemon Mode Architecture
CRITICAL: Temporal indexing works identically in standalone and daemon modes.
Standalone Mode (daemon.enabled: false):
- CLI directly instantiates
TemporalIndexer - Executes indexing in-process with direct progress callbacks
- Progress bar displayed directly in terminal
Daemon Mode (daemon.enabled: true):
- CLI delegates to daemon via
_index_via_daemon()withindex_commitsflag - Daemon's
exposed_index_blocking()instantiatesTemporalIndexerinternally - Executes indexing synchronously (blocking RPC call)
- Progress callbacks streamed back to CLI over RPyC connection
- Client displays progress bar in real-time (perfect UX parity with standalone)
- Cache automatically invalidated before and after indexing
Mode Detection:
- Automatic based on
.code-indexer/config.jsondaemon configuration - Zero code changes needed in
TemporalIndexer(mode-agnostic design) - Same progress callback signature in both modes
- Same SQLite/file operations in both modes
- Same git subprocess calls in both modes
Cache Invalidation (Daemon Mode Only):
- Temporal indexing adds new vectors to FilesystemVectorStore for historical blobs
- Semantic HNSW index must be rebuilt to include new vectors
- FTS index (if enabled) must include new historical content
- Already implemented in
daemon/service.py::exposed_index_blocking()(lines 195-199) - No additional cache invalidation code needed
Query Flow Architecture
Two-Phase Query:
- Semantic Search: Existing HNSW index on FilesystemVectorStore
- Temporal Filtering: SQLite filtering by time/commit
Performance Targets:
- Semantic HNSW search: ~200ms
- SQLite temporal filter: ~50ms
- Total: <300ms for typical queries
Implementation Guidelines
Critical Requirements
-
Lazy Module Loading (MANDATORY from CLAUDE.md):
- ALL temporal modules MUST use lazy imports
- Follow FTS lazy loading pattern from smart_indexer.py
- Guarantee:
cidx --helpand non-temporal queries unchanged
-
Storage Optimization:
- SQLite with compound indexes for 40K+ repos
- Blob registry with SQLite migration path
-
Branch-Aware Indexing Strategy:
- Default: Index current branch only (cost-effective, 91%+ coverage)
- Opt-in:
--all-branchesflag for complete multi-branch indexing - Track branch metadata for all commits to enable future branch-aware queries
- Optional --max-commits and --since-date for control
- Warn users about storage/API costs before indexing all branches
-
Error Handling:
- Graceful fallback to space-only search
- Clear warnings and suggested actions
-
Backward Compatibility:
- All temporal features opt-in via flags
- Existing functionality unchanged
Implementation Order
- Story 1 (Feature 01): Git History Indexing with Branch Metadata (Foundation)
- Story 2 (Feature 01): Incremental Indexing with Branch Tracking (Optimization)
- Story 3 (Feature 01): Selective Branch Indexing Patterns (Advanced)
- Story 1 (Feature 02): Time-Range Filtering with Branch Context (Core Query)
- Story 2 (Feature 02): Point-in-Time Query with Branch Info (Advanced)
- Story 1 (Feature 03): Evolution Display with Branch Visualization (Visualization)
- Story 1 (Feature 04): Golden Repo Registration with Temporal (API Server)
- Story 1 (Feature 05): Temporal Query Parameters via API (API Server)
Acceptance Criteria
Functional Requirements (Both Modes)
- Temporal indexing with
cidx index --index-commits(defaults to current branch) - Multi-branch indexing with
cidx index --index-commits --all-branches - Selective branch indexing with
cidx index --index-commits --branches "feature/*,bugfix/*" - Time-range queries with
--time-rangeshowing branch context - Point-in-time queries with
--at-commitwith branch information - Evolution display with
--show-evolutionincluding branch visualization - Incremental indexing on re-runs preserving branch metadata
Daemon Mode Requirements
- Temporal indexing works in daemon mode when
daemon.enabled: true - CLI automatically delegates to daemon via
_index_via_daemon() - Progress callbacks stream correctly from daemon to client
- Progress bar displays identically in both modes (UX parity)
- Cache invalidation before temporal indexing (daemon mode)
- Cache automatically cleared after temporal indexing completes
- Graceful fallback to standalone mode if daemon unavailable
- All temporal flags passed correctly through delegation layer
- Watch mode integration with config (respects branch settings)
- Cost warning before indexing all branches in large repos
Performance Requirements
- Query performance <300ms on 40K+ repos (with branch filtering)
- 92%+ storage savings via blob deduplication
- SQLite indexes optimized for scale including branch queries
- Single-branch indexing completes in reasonable time (similar to current indexing)
- Cost warning displays accurate estimates (storage, API calls) for multi-branch indexing
Quality Requirements
- Graceful error handling with fallback
- Lazy loading preserves startup time
- All tests passing including E2E
Risk Mitigation
| Risk | Impact | Mitigation |
|---|---|---|
| SQLite performance at scale | High | Compound indexes, WAL mode, tuning, branch-specific indexes |
| Storage growth with all-branches | High | Default to single branch, cost warnings, blob deduplication |
| Module import overhead | High | Mandatory lazy loading pattern |
| Git operations slow | Medium | Incremental indexing, caching, single-branch default |
| Breaking changes | High | All features opt-in via flags |
| Users accidentally index all branches | Medium | Explicit flag required, cost warning with confirmation |
| Branch metadata overhead | Low | Indexed efficiently, minimal query impact |
| Daemon cache coherence | Medium | Automatic cache invalidation already implemented |
| Mode-specific bugs | Medium | Mode-agnostic design, comprehensive testing in both modes |
Dependencies
- Existing: FilesystemVectorStore, HNSW index, HighThroughputProcessor
- Existing: Daemon mode architecture (cli_daemon_delegation.py, daemon/service.py)
- New: sqlite3 (lazy), difflib (lazy), git commands
- Configuration: enable_temporal setting
- Feature 04 depends on Features 01-03 (CLI temporal implementation must exist)
- Feature 05 depends on Feature 04 (temporal index must be created during registration)
- Both API features integrate with existing server architecture (GoldenRepoManager, SemanticSearchService)
- Daemon integration: No new dependencies (uses existing RPC and cache invalidation)
Testing Strategy
Unit Tests (Mode-Agnostic)
- TemporalIndexer blob registry building
- TemporalSearchService filtering logic
- SQLite performance with mock data
- Branch metadata storage and retrieval
- Cost estimation calculations
Integration Tests (Standalone Mode)
- End-to-end temporal indexing in standalone mode
- Query filtering accuracy
- Watch mode temporal updates
- Single-branch vs all-branches behavior
- Cost warnings and confirmations
Integration Tests (Daemon Mode)
- CRITICAL: All temporal operations must work in daemon mode
- Temporal indexing delegation via
_index_via_daemon() - Progress callback streaming from daemon to client
- Cache invalidation before/after temporal indexing
- UX parity verification (progress bar display)
- Graceful fallback to standalone if daemon unavailable
- Daemon remains responsive during long temporal indexing operations
Manual Tests (Both Modes)
- Each story has specific manual test scenarios
- Performance validation on large repos (40K+ commits, 1000+ branches)
- Error handling verification
- Test in daemon mode: Enable daemon, verify identical behavior
- Test cache coherence: Query before/after temporal indexing in daemon mode
- Test progress streaming: Verify real-time progress in daemon mode
32-Mode Combination Test Matrix
CRITICAL: CIDX has 5 independent binary operational dimensions that create 2^5 = 32 possible mode combinations. All combinations are valid and must be tested.
Five Binary Dimensions
- Daemon Mode (D): Standalone (
-) vs Daemon (D) - Branch Indexing (B): Single branch (
-) vs All branches (B) - Watch Mode (W): Disabled (
-) vs Enabled (W) - FTS Index (F): Disabled (
-) vs Enabled (F) - Temporal Index (T): Disabled (
-) vs Enabled (T)
Notation: Each mode represented as 5-character string (e.g., D-WFT = Daemon + Single Branch + Watch + FTS + Temporal)
Mode Combination Table
All 32 combinations are VALID and supported:
| Tier | Example Modes | Description | Test Priority |
|---|---|---|---|
| Tier 1: Basic | -----, D----, ---FT, D--FT |
Core functionality | CRITICAL |
| Tier 2: Watch | --W--, D-W--, --WFT, D-WFT |
Incremental updates | HIGH |
| Tier 3: Multi-Branch | -B---, DB---, -B-FT, DB-FT |
Branch deduplication | MEDIUM |
| Tier 4: Production | -BWFT, DBWFT |
Full deployment modes | CRITICAL |
Complete Matrix: See .analysis/32_mode_test_matrix_20251102.md for:
- All 32 combinations enumerated
- Test strategy per tier
- Performance expectations per mode
- Common production use cases
Key Test Scenarios by Dimension
Daemon vs Standalone:
- Cache coordination (daemon only)
- Concurrent queries (daemon only)
- Query performance (daemon 10-15x faster after warmup)
Single vs All Branches:
- Blob deduplication across branches
- Branch-specific temporal queries
- Index size verification
Watch Mode:
- Automatic incremental indexing
- Git change detection
- Temporal index updates (new commits)
FTS Index:
- Exact text search functionality
- Performance vs semantic search
- Compatibility with temporal queries
Temporal Index:
- Git history indexing
- Time-range and point-in-time queries
- Code evolution display
- SQLite database integrity
Production Deployment Modes
Development Mode: --WFT (Mode 8)
- Standalone + Single Branch + Watch + FTS + Temporal
- Developer working on single repository
- Automatic updates, all query types available
Team Server Mode: D--FT (Mode 20)
- Daemon + Single Branch + FTS + Temporal
- Shared server for team queries
- Fast queries, all features, single main branch
Multi-Repo Server: DBWFT (Mode 32)
- Daemon + All Branches + Watch + FTS + Temporal
- Large-scale code search infrastructure
- Maximum functionality, automatic updates, fast queries
Test Execution Strategy
Phase 1: Smoke Tests (4 modes)
- Mode 1 (
-----) - Baseline semantic search - Mode 2 (
----T) - Temporal only - Mode 17 (
D----) - Daemon only - Mode 32 (
DBWFT) - Everything enabled
Phase 2: Feature Tests (16 modes)
- All temporal-enabled modes
- Verify temporal queries work in all contexts
Phase 3: Daemon Tests (16 modes)
- Modes 17-32 (all daemon modes)
- Verify cache coordination and concurrent access
Phase 4: Complete Matrix (32 modes)
- All combinations tested
- Cross-mode consistency verified
- 100% coverage of valid combinations
Implementation
# Test pattern for mode combinations
@pytest.mark.parametrize("mode_config", [
{"daemon": False, "all_branches": False, "watch": False, "fts": False, "temporal": False},
{"daemon": False, "all_branches": False, "watch": False, "fts": False, "temporal": True},
# ... (all 32 combinations)
])
def test_mode_combination(mode_config, temp_git_repo):
"""Test specific mode combination works correctly."""
# Setup configuration
config = create_config_from_mode(mode_config)
# Initialize and index
if mode_config["daemon"]:
daemon = start_daemon(temp_git_repo, config)
index_repository(
repo_path=temp_git_repo,
all_branches=mode_config["all_branches"],
enable_fts=mode_config["fts"],
enable_temporal=mode_config["temporal"],
watch_mode=mode_config["watch"]
)
# Verify features
verify_index_features(temp_git_repo, mode_config)
verify_query_functionality(temp_git_repo, mode_config)
verify_feature_isolation(temp_git_repo, mode_config)Verification Criteria (Each Mode):
- Index creation succeeds for configured indexes
- Semantic queries return results
- Enabled features work correctly
- Disabled features fail gracefully
- Performance within expected ranges
- Cache coherence (daemon only)
- Incremental updates (watch only)
Test Files:
tests/integration/test_32_mode_combinations.py- Main matrixtests/integration/test_daemon_mode_combinations.py- Modes 17-32tests/integration/test_watch_mode_combinations.py- Watch-enabledtests/integration/test_temporal_mode_combinations.py- Temporal-enabled
Documentation Requirements
- Update README.md with temporal search examples
- Add temporal flags to --help
- Document performance tuning for large repos
- Include troubleshooting guide
Success Metrics
- Query latency <300ms (P95)
- Storage efficiency >80% deduplication
- Zero impact on non-temporal operations
- User adoption by AI coding agents
Performance Budget Allocation
Target: <300ms Total Query Latency (P95)
| Component | Budget | Justification |
|---|---|---|
| Semantic Search (HNSW) | 50-100ms | Cached index in daemon mode, 800ms cold |
| SQLite Temporal Filtering | 10-20ms | Indexed queries on 40K+ commits |
| Removed Code Detection | 5-10ms | Hash lookups in blob registry |
| Result Formatting | 5-10ms | Rich output rendering |
| Network/IPC Overhead | 10-20ms | Daemon mode communication |
| Total | 80-160ms | Leaves 140-220ms buffer |
Indexing Performance Targets:
| Operation | Target | Scale |
|---|---|---|
| Blob deduplication lookup | <1ms per blob | SQLite indexed query |
| Git blob read | 5-10ms per blob | git cat-file subprocess |
| Embedding generation | 50-100ms per batch | VoyageAI API call (50 chunks) |
| SQLite write (commit) | <5ms | EXCLUSIVE transaction |
| Full temporal index (40K commits, 12K unique blobs) | 4-7 minutes | End-to-end with deduplication |
SQLite Performance Expectations:
Realistic (Not "Microseconds"):
- Blob registry lookup: <1ms (indexed query, 40K blobs)
- Commit metadata insert: <5ms (EXCLUSIVE transaction)
- Temporal query filtering: 10-20ms (indexed date/hash queries)
- Branch metadata lookup: <2ms (indexed by commit_hash)
Configuration for Performance:
PRAGMA journal_mode=WAL; -- Concurrent readers
PRAGMA busy_timeout=5000; -- 5s lock wait
PRAGMA synchronous=NORMAL; -- Balance safety/speed
PRAGMA cache_size=8192; -- 8MB cache
PRAGMA temp_store=MEMORY; -- In-memory temp tablesValidation: Performance targets based on:
- Measured HNSW query times in existing codebase
- SQLite benchmarks for similar dataset sizes
- VoyageAI API documented latencies
- Real-world testing on Evolution repository (89K commits)
Notes
Conversation Context:
- User emphasized storage efficiency for 40K+ repos
- Configuration-driven watch mode integration
- Graceful degradation critical
- MANDATORY lazy loading from CLAUDE.md
Branch Strategy Analysis (Evolution Repository):
- Analyzed real-world large enterprise codebase (Evolution - 1,135 branches, 89K commits)
- Single branch (development): 81,733 commits = 91.6% coverage
- All branches: 89,234 commits = 100% coverage but 85.5% storage increase
- Deduplication effectiveness: 92.4% (most code shared between branches)
- Recommendation: Default to single branch, opt-in for all branches
- Cost transparency: Warn users before expensive operations
- See
.analysis/temporal_indexing_branch_analysis.mdfor complete analysis
Key Design Decisions:
- Default = Current Branch Only: Cost-effective, covers 90%+ of real-world use cases
- Explicit Opt-in for All Branches: Users must consciously choose expensive operation
- Track Branch Metadata Always: Even single-branch indexing records which branch
- Cost Warnings: Display storage/API estimates before multi-branch indexing
- Future-Proof Schema: Branch metadata enables future branch-aware queries
Quality Assurance & Validation Reports
Epic Status: ✅ GO - Ready for Implementation (Risk: <10%)
This Epic underwent comprehensive validation through Codex Architect pressure testing. All critical issues identified have been resolved. The following reports document the validation process and resolution:
Critical Issues Resolution (Codex Pressure Test)
- All Critical Issues Complete - Final status: GO achievement, <10% risk
- Codex Pressure Test Response - Original pressure test findings (NO-GO verdict)
Individual Issue Reports
- Issue #1: Architectural Audit - Architecture verification (verified correct)
- Issue #2: Component Reuse - Realistic component reuse (60-65%)
- Issue #3: Progress Callbacks - RPyC-safe progress specification
- Issues #1-4 Combined - Memory management strategy
- Issue #5: Git Performance - Benchmarked on Evolution repo (89K commits)
Testing Guidance
- E2E Test Exclusions - fast-automation.sh exclusion strategy
Risk Evolution:
- Initial (NO-GO): 75% failure risk, 5 critical issues
- Final (GO): <10% failure risk, all issues resolved
- Implementation Readiness: 95%
Key Validations Completed:
- ✅ Git performance benchmarked on real 89K commit repository
- ✅ Realistic component reuse expectations (60-65%)
- ✅ RPyC-safe progress callback specification
- ✅ Memory management strategy for 4GB systems
- ✅ Architecture correctness verified