Skip to content

feat(queries): expose fileHash in where and query JSON output#257

Merged
carlos-alm merged 7 commits intomainfrom
feat/expose-file-hash
Mar 3, 2026
Merged

feat(queries): expose fileHash in where and query JSON output#257
carlos-alm merged 7 commits intomainfrom
feat/expose-file-hash

Conversation

@carlos-alm
Copy link
Contributor

@carlos-alm carlos-alm commented Mar 3, 2026

Summary

  • Add getFileHash helper in src/queries.js that looks up file content hashes from the file_hashes table
  • Add fileHash field to queryNameData, whereSymbolImpl (symbol mode), and whereFileImpl (file mode) return objects
  • Enables downstream consumers (e.g. code-praxis) to detect when a rule's target file has changed since rule creation (staleness detection)

Test plan

  • Added file_hashes rows to integration test fixture DB
  • Assert fileHash value in queryNameData test
  • Assert fileHash value in whereData symbol mode test
  • Assert fileHash value in whereData file mode test
  • All 54 integration tests pass
  • Verified fileHash appears in live CLI output: where --json, where -f --json, query --json

Write symbol-level change events to .codegraph/change-events.ndjson
during watch mode. Each line records added/removed/modified symbols
with node counts and edge data, enabling external tools to detect
rule staleness without polling. File is size-capped at 1 MB with
keep-last-half rotation.

Impact: 8 functions changed, 4 affected
Impact: 1 functions changed, 1 affected
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@claude
Copy link

claude bot commented Mar 3, 2026

Claude finished @carlos-alm's task —— View job


PR Review Complete

Reviewing PR #257: feat(watcher): structured NDJSON change journal

Todo List:

  • Read repository context and guidelines
  • Analyze changed files and implementation
  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity Assessment
  • Phase 4: Code Quality Review
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR adds structured change event logging to complement the existing simple changes.journal file. The existing journal only tracks file paths that changed/were deleted, but this new feature provides symbol-level granularity with before/after snapshots of function definitions, classes, etc.

Why This Approach: The NDJSON format enables streaming consumption by external tools, and the symbol diffing provides semantic change tracking beyond file-level changes. This is valuable for impact analysis and change visualization tools.

Risk Assessment: LOW RISK - This is purely additive functionality that runs alongside existing journal functionality without replacing it.

Configuration Change Detection

No problematic configuration changes detected

  • No linting rules modified
  • No test configurations altered
  • No build settings changed
  • No TypeScript strictness reduced

Test Integrity Assessment

Strong test coverage

  • 21 comprehensive unit tests covering all new functions
  • Tests include edge cases (empty inputs, file rotation, error handling)
  • Existing journal tests confirmed unaffected (17 tests still passing)
  • Full test suite remains green (1097 tests, 0 failures)

Code Quality Review

Strengths

  1. Clean API Design (src/change-journal.js)

    • Well-separated concerns with focused functions
    • Clear function signatures with proper JSDoc
    • Consistent error handling patterns
  2. Symbol Diffing Logic (diffSymbols function, lines 24-55)

    • Clever use of name\0kind composite keys for symbol identity
    • Handles all change types: added, removed, modified (line changes)
    • Proper handling of same-name, different-kind symbols
  3. Integration with Watcher (src/watcher.js lines 29, 146, 244-255)

    • Non-invasive integration - captures before/after snapshots cleanly
    • Maintains existing journal functionality alongside new change events
    • Proper error isolation - change event failures don't break watch mode
  4. File Rotation Strategy (rotateIfNeeded, lines 103-122)

    • Size-capped at 1MB with keep-last-half rotation
    • Handles edge cases (file doesn't exist, single huge line)
    • Non-fatal failure handling with proper logging

Minor Concerns

  1. File I/O Error Handling (appendChangeEvents, lines 81-96)

    • Silent failure on directory creation or file write could mask disk space issues
    • Consider logging warnings for better observability
  2. Rotation Logic Edge Case (lines 115-118)

    • If indexOf('\n', mid) returns -1, the function returns early without rotation
    • A very long single line could grow unbounded, though this is unlikely in practice
  3. Performance Consideration

    • Symbol diffing on every file change could add latency in high-churn scenarios
    • However, the debouncing in watcher (300ms) mitigates this concern

Security Assessment

No security concerns identified

  • File paths are properly normalized and validated
  • No command injection vectors
  • JSON serialization uses standard library functions
  • Directory creation uses safe recursive flag

Technical Implementation

The integration is well-architected:

  • Reuses existing symbol extraction from parseFileIncremental
  • Leverages the same database queries for before/after snapshots
  • Maintains separation between simple journal and rich change events
  • Uses proper NDJSON format for streaming consumption

The test coverage is exceptional - the 21 unit tests cover:

  • Symbol diffing edge cases and complex scenarios
  • Event building and serialization
  • File rotation mechanics
  • Full lifecycle integration tests

Final Recommendation

This is a solid, well-tested feature addition that enhances codegraph's change tracking capabilities without breaking existing functionality. The implementation follows project conventions and handles errors gracefully.

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation with comprehensive tests, solving a real need for structured change tracking. The additive nature and non-fatal error handling make this a safe enhancement.

Suggested follow-up: Consider adding debug logging for file I/O operations to improve observability in production environments.

Add fileHash field to queryNameData, whereSymbolImpl, and whereFileImpl
return objects by looking up the file_hashes table. This lets consumers
(e.g. code-praxis) detect when a rule's target file has changed since
the rule was created, enabling staleness detection.

Impact: 4 functions changed, 16 affected
@claude
Copy link

claude bot commented Mar 3, 2026

Claude finished @carlos-alm's task —— View job


Claude finished @carlos-alm's task —— View job


PR Review Complete

Reviewing PR #257: feat(watcher): structured NDJSON change journal

Todo List:

  • Read repository context and guidelines
  • Analyze changed files and implementation
  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity Assessment
  • Phase 4: Code Quality Review
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR adds structured change event logging to complement the existing simple changes.journal file. The existing journal only tracks file paths that changed/were deleted, but this new feature provides symbol-level granularity with before/after snapshots of function definitions, classes, etc.

Why This Approach: The NDJSON format enables streaming consumption by external tools, and the symbol diffing provides semantic change tracking beyond file-level changes. This is valuable for impact analysis and change visualization tools.

Risk Assessment: LOW RISK - This is purely additive functionality that runs alongside existing journal functionality without replacing it.

Configuration Change Detection

No problematic configuration changes detected

  • No linting rules modified
  • No test configurations altered
  • No build settings changed
  • No TypeScript strictness reduced

Test Integrity Assessment

Strong test coverage

  • 21 comprehensive unit tests covering all new functions
  • Tests include edge cases (empty inputs, file rotation, error handling)
  • Existing journal tests confirmed unaffected (17 tests still passing)
  • Full test suite remains green (1097 tests, 0 failures)

Code Quality Review

Strengths

  1. Clean API Design (src/change-journal.js)

    • Well-separated concerns with focused functions
    • Clear function signatures with proper JSDoc
    • Consistent error handling patterns
  2. Symbol Diffing Logic (diffSymbols function, lines 24-55)

    • Clever use of name\0kind composite keys for symbol identity
    • Handles all change types: added, removed, modified (line changes)
    • Proper handling of same-name, different-kind symbols
  3. Integration with Watcher (src/watcher.js lines 29, 146, 244-255)

    • Non-invasive integration - captures before/after snapshots cleanly
    • Maintains existing journal functionality alongside new change events
    • Proper error isolation - change event failures don't break watch mode
  4. File Rotation Strategy (rotateIfNeeded, lines 103-122)

    • Size-capped at 1MB with keep-last-half rotation
    • Handles edge cases (file doesn't exist, single huge line)
    • Non-fatal failure handling with proper logging

Minor Concerns

  1. File I/O Error Handling (appendChangeEvents, lines 81-96)

    • Silent failure on directory creation or file write could mask disk space issues
    • Consider logging warnings for better observability
  2. Rotation Logic Edge Case (lines 115-118)

    • If indexOf('\n', mid) returns -1, the function returns early without rotation
    • A very long single line could grow unbounded, though this is unlikely in practice
  3. Performance Consideration

    • Symbol diffing on every file change could add latency in high-churn scenarios
    • However, the debouncing in watcher (300ms) mitigates this concern

Security Assessment

No security concerns identified

  • File paths are properly normalized and validated
  • No command injection vectors
  • JSON serialization uses standard library functions
  • Directory creation uses safe recursive flag

Technical Implementation

The integration is well-architected:

  • Reuses existing symbol extraction from parseFileIncremental
  • Leverages the same database queries for before/after snapshots
  • Maintains separation between simple journal and rich change events
  • Uses proper NDJSON format for streaming consumption

The test coverage is exceptional - the 21 unit tests cover:

  • Symbol diffing edge cases and complex scenarios
  • Event building and serialization
  • File rotation mechanics
  • Full lifecycle integration tests

Final Recommendation

This is a solid, well-tested feature addition that enhances codegraph's change tracking capabilities without breaking existing functionality. The implementation follows project conventions and handles errors gracefully.

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation with comprehensive tests, solving a real need for structured change tracking. The additive nature and non-fatal error handling make this a safe enhancement.

Suggested follow-up: Consider adding debug logging for file I/O operations to improve observability in production environments.

@carlos-alm carlos-alm changed the title feat(watcher): structured NDJSON change journal feat(queries): expose fileHash in where and query JSON output Mar 3, 2026
@carlos-alm
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@carlos-alm
Copy link
Contributor Author

recheck

@carlos-alm
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

Address review feedback: debug log on successful append and rotation,
warn when a single oversized line prevents rotation.

Impact: 2 functions changed, 2 affected
Impact: 1 functions changed, 1 affected
@greptile-apps
Copy link

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR adds structured NDJSON change journaling to watch mode, enabling symbol-level tracking of file modifications.

  • Created src/change-journal.js with diffSymbols() for comparing symbol snapshots, buildChangeEvent() for event construction, and appendChangeEvents() with automatic rotation at 1MB
  • Enhanced src/watcher.js to capture before/after symbol states during incremental updates and write change events to .codegraph/change-events.ndjson
  • Added 21 comprehensive unit tests covering all functions, edge cases, and full lifecycle scenarios
  • All writes are non-fatal with graceful error handling
  • Minor byte/character precision issue in rotation logic (see inline comment)

Confidence Score: 4/5

  • safe to merge with minimal risk
  • well-tested feature addition with clean integration, non-fatal error handling, and only minor style suggestions; no critical bugs or security issues
  • no files require special attention

Important Files Changed

Filename Overview
src/change-journal.js new module for symbol-level change tracking with NDJSON output and size-capped rotation; minor byte/character mismatch in rotation logic
src/watcher.js clean integration of change journal into watch mode, capturing before/after symbol snapshots with proper error handling
tests/unit/change-journal.test.js comprehensive test coverage with 21 tests covering all functions, edge cases, and full lifecycle scenarios

Sequence Diagram

sequenceDiagram
    participant Watcher as watcher.js
    participant UpdateFile as updateFile()
    participant DB as SQLite
    participant Journal as change-journal.js
    participant FS as File System

    Watcher->>UpdateFile: file change detected
    UpdateFile->>DB: listSymbols.all(relPath)
    DB-->>UpdateFile: oldSymbols[]
    
    UpdateFile->>DB: deleteEdgesForFile, deleteNodes
    UpdateFile->>DB: insert new nodes & edges
    
    UpdateFile->>DB: listSymbols.all(relPath)
    DB-->>UpdateFile: newSymbols[]
    
    UpdateFile->>Journal: diffSymbols(oldSymbols, newSymbols)
    Journal-->>UpdateFile: symbolDiff {added, removed, modified}
    
    UpdateFile-->>Watcher: return result with symbolDiff
    
    Watcher->>Journal: buildChangeEvent(file, event, symbolDiff, counts)
    Journal-->>Watcher: changeEvent object
    
    Watcher->>Journal: appendChangeEvents(rootDir, [changeEvent])
    Journal->>FS: appendFileSync(change-events.ndjson)
    Journal->>Journal: rotateIfNeeded(filePath, 1MB)
    
    alt file > 1MB
        Journal->>FS: writeFileSync (keep last ~half)
    end
Loading

Last reviewed commit: 5e49ba6

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +111 to +118

if (stat.size <= maxBytes) return;

try {
const content = fs.readFileSync(filePath, 'utf-8');
const mid = Math.floor(content.length / 2);
const newlineIdx = content.indexOf('\n', mid);
if (newlineIdx === -1) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte vs character mismatch in rotation logic

stat.size uses bytes but content.length counts characters. For multi-byte UTF-8 (emoji, non-ASCII), the midpoint calculation may not align with the byte-based threshold, keeping more or less than half the file.

Consider using Buffer.byteLength(content) to calculate the midpoint in bytes for consistency with the threshold check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8b4bf03 — rotation now reads as a raw Buffer and uses buf.indexOf(0x0a, mid) to find the newline at the byte-level midpoint, keeping the threshold check and midpoint calculation both in bytes.

@carlos-alm
Copy link
Contributor Author

Addressed both minor concerns from the automated review in commit a87c5da + 5e49ba6:

  1. File I/O observabilityappendChangeEvents already called warn() on failure, but lacked positive-path logging. Added debug() after successful appends and rotations so verbose mode shows the event flow.

  2. Rotation edge case — the no-newline early return now emits a warn() instead of silently returning, making it visible if a single oversized line ever prevents rotation.

@greptileai

@claude
Copy link

claude bot commented Mar 3, 2026

Claude finished @carlos-alm's task —— View job


****


PR Review Complete

Reviewing PR #257: feat(queries): expose fileHash in where and query JSON output

Todo List:

  • Read repository context and guidelines
  • Analyze changed files and implementation
  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity Assessment
  • Phase 4: Code Quality Review
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR adds fileHash field exposure to query and where commands to enable staleness detection for downstream tools like code-praxis. The existing file_hashes table already tracks content hashes during incremental builds, but these weren't exposed in JSON output.

Why This Approach: Clean and minimal - adds a single helper function getFileHash() that queries the existing file_hashes table and exposes it in 3 key query functions. This leverages existing infrastructure without adding complexity.

Risk Assessment: VERY LOW RISK - This is purely additive functionality that doesn't modify existing behavior, just adds an optional field to JSON output.

Configuration Change Detection

No configuration changes detected

  • No linting rules modified
  • No test configurations altered
  • No build settings changed
  • biome.json unchanged - still using recommended rules with no relaxations

Test Integrity Assessment

Strong test coverage and integrity

  • Integration tests properly updated with file_hashes table data (lines 106-114)
  • 3 new assertions added for fileHash field in queryNameData, whereData symbol mode, and whereData file mode
  • Test data uses deterministic hash values: hash_auth_js, hash_middleware_js, etc.
  • No test expectations weakened or assertions removed
  • All existing tests continue to pass

Code Quality Review

Strengths

  1. Clean Implementation (src/queries.js:2736-2739)

    function getFileHash(db, file) {
      const row = db.prepare('SELECT hash FROM file_hashes WHERE file = ?').get(file);
      return row ? row.hash : null;
    }
    • Simple, focused helper function
    • Handles missing entries gracefully (returns null)
    • Uses prepared statements for SQL safety
  2. Consistent Integration

    • Added to all 3 relevant functions: queryNameData (line 237), whereSymbolImpl (line 2772), whereFileImpl (line 2823)
    • Follows existing code patterns and naming conventions
    • Non-breaking change - new field is optional
  3. Proper Test Coverage

    • Test fixture properly sets up file_hashes table data
    • Assertions verify the exact expected hash values
    • Covers all 3 code paths that expose the field

Minor Considerations

  1. Performance: The getFileHash() function makes a separate DB query for each file. For bulk operations with many files, this could result in N+1 queries. However:

    • This matches the existing query pattern in the codebase
    • The functions being modified already make multiple queries per result
    • Optimization would require significant architectural changes beyond scope
  2. Null Handling: Returns null when file not found in file_hashes table, which is appropriate since:

    • It's consistent with other optional fields
    • File hashes are only populated during incremental builds
    • JSON consumers can easily check for null

Security Assessment

No security concerns

  • Uses prepared statements to prevent SQL injection
  • Only reads from the database, no writes
  • No external input validation needed (file paths already validated upstream)
  • No credential or sensitive data exposure

Final Recommendation

This is a well-executed, minimal feature addition that solves a real need for downstream tooling. The implementation is clean, well-tested, and follows established patterns in the codebase.

Recent improvements (commits a87c5da, 5e49ba6): Author proactively addressed the automated review feedback by adding debug/warn logging for better observability and fixing biome formatting issues.

  • Rating: ⭐⭐⭐⭐ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation with comprehensive test coverage. The additive nature ensures no breaking changes, and the feature enables valuable downstream functionality for staleness detection. Code follows project conventions and handles edge cases appropriately.

@carlos-alm
Copy link
Contributor Author

recheck

1 similar comment
@carlos-alm
Copy link
Contributor Author

recheck

stat.size returns bytes but String.length counts UTF-16 characters.
Read as Buffer and use buf.indexOf(0x0a) to find the newline at the
byte-level midpoint, ensuring consistent behavior with multi-byte UTF-8.

Impact: 1 functions changed, 1 affected
@carlos-alm
Copy link
Contributor Author

@greptileai

@claude
Copy link

claude bot commented Mar 3, 2026

Claude finished @carlos-alm's task —— View job


Claude is reviewing this PR

PR Review: feat(queries): expose fileHash in where and query JSON output

Todo List:

  • Read repository context and guidelines
  • Analyze changed files and implementation
  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity Assessment
  • Phase 4: Code Quality Review
  • Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: This PR adds fileHash field exposure to query and where commands to enable staleness detection for downstream tools like code-praxis. The existing file_hashes table already tracks content hashes during incremental builds, but these weren't exposed in JSON output.

Why This Approach: Clean and minimal - adds a single helper function getFileHash() that queries the existing file_hashes table and exposes it in 3 key query functions. This leverages existing infrastructure without adding complexity.

Risk Assessment: VERY LOW RISK - This is purely additive functionality that doesn't modify existing behavior, just adds an optional field to JSON output.

Configuration Change Detection

No configuration changes detected

  • No linting rules modified
  • No test configurations altered
  • No build settings changed
  • biome.json unchanged - still using recommended rules with no relaxations

Test Integrity Assessment

Strong test coverage and integrity

  • Integration tests properly updated with file_hashes table data (lines 106-114)
  • 3 new assertions added for fileHash field in queryNameData, whereData symbol mode, and whereData file mode
  • Test data uses deterministic hash values: hash_auth_js, hash_middleware_js, etc.
  • No test expectations weakened or assertions removed
  • All existing tests continue to pass

Code Quality Review

Strengths

  1. Clean Implementation (src/queries.js:2736-2739)

    function getFileHash(db, file) {
      const row = db.prepare('SELECT hash FROM file_hashes WHERE file = ?').get(file);
      return row ? row.hash : null;
    }
    • Simple, focused helper function
    • Handles missing entries gracefully (returns null)
    • Uses prepared statements for SQL safety
  2. Consistent Integration

    • Added to all 3 relevant functions: queryNameData (line 237), whereSymbolImpl (line 2772), whereFileImpl (line 2823)
    • Follows existing code patterns and naming conventions
    • Non-breaking change - new field is optional
  3. Proper Test Coverage

    • Test fixture properly sets up file_hashes table data
    • Assertions verify the exact expected hash values
    • Covers all 3 code paths that expose the field

Minor Considerations

  1. Performance: The getFileHash() function makes a separate DB query for each file. For bulk operations with many files, this could result in N+1 queries. However:

    • This matches the existing query pattern in the codebase
    • The functions being modified already make multiple queries per result
    • Optimization would require significant architectural changes beyond scope
  2. Null Handling: Returns null when file not found in file_hashes table, which is appropriate since:

    • It's consistent with other optional fields
    • File hashes are only populated during incremental builds
    • JSON consumers can easily check for null

Security Assessment

No security concerns

  • Uses prepared statements to prevent SQL injection
  • Only reads from the database, no writes
  • No external input validation needed (file paths already validated upstream)
  • No credential or sensitive data exposure

Final Recommendation

This is a well-executed, minimal feature addition that solves a real need for downstream tooling. The implementation is clean, well-tested, and follows established patterns in the codebase.

Recent improvements (commits a87c5da, 5e49ba6): Author proactively addressed the automated review feedback by adding debug/warn logging for better observability and fixing biome formatting issues.

  • Rating: ⭐⭐⭐⭐ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation with comprehensive test coverage. The additive nature ensures no breaking changes, and the feature enables valuable downstream functionality for staleness detection. Code follows project conventions and handles edge cases appropriately.

@carlos-alm carlos-alm merged commit e4aa4f7 into main Mar 3, 2026
20 checks passed
@carlos-alm carlos-alm deleted the feat/expose-file-hash branch March 3, 2026 00:21
@github-actions github-actions bot locked and limited conversation to collaborators Mar 3, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant