Skip to content

feat: add dataflow analysis (flows_to, returns, mutates)#254

Merged
carlos-alm merged 2 commits intomainfrom
feat/dataflow-analysis
Mar 3, 2026
Merged

feat: add dataflow analysis (flows_to, returns, mutates)#254
carlos-alm merged 2 commits intomainfrom
feat/dataflow-analysis

Conversation

@carlos-alm
Copy link
Contributor

Summary

  • Adds dataflow analysis to track how data moves through functions with three new edge types: flows_to (parameter/variable passed as argument), returns (call return value captured), and mutates (parameter-derived value mutated in-place)
  • Opt-in via build --dataflow flag — JS/TS only for MVP, extraction runs as a second AST pass after complexity analysis
  • New dataflow <name> CLI command with --path <target> (BFS data flow path) and --impact (return-value-dependent blast radius) modes
  • Schema migration v10 adds separate dataflow table with confidence scoring and expression tracking

Changes

File Change
src/dataflow.js New: extractDataflow() AST walker with scope tracking, buildDataflowEdges(), query functions (dataflowData, dataflowPathData, dataflowImpactData), CLI formatters
src/db.js Migration v10: dataflow table with source_id, target_id, kind, param_index, expression, line, confidence + indexes
src/builder.js --dataflow opt-in phase after complexity, incremental cleanup for dataflow table, full build cascade
src/cli.js --dataflow flag on build, new dataflow command with all standard options
src/mcp.js dataflow tool in BASE_TOOLS with edges/path/impact modes
src/batch.js dataflow added to BATCH_COMMANDS
src/index.js Programmatic API exports
tests/parsers/dataflow-javascript.test.js 19 unit tests for extractDataflow()
tests/integration/dataflow.test.js 20 integration tests for query functions
tests/unit/mcp.test.js Updated tool count

Dogfood results

$ codegraph build . --dataflow
Dataflow: 1439 edges inserted
Graph built: 955 nodes, 1588 edges

$ codegraph dataflow buildGraph -T
  Data flows TO:
    → openDb (src/db.js:343) arg[0] [90%]
    → initSchema (src/db.js:344) arg[0] [90%]
    → loadConfig (src/config.js:346) arg[0]
    ...

$ codegraph dataflow openDb --path initSchema -T
  Data flow path: openDb → initSchema  (2 hops)
    ● openDb (src/db.js:189)
    ├─ buildGraph (src/builder.js:341) [returns]
    │ ├─ initSchema (src/db.js:245) [flows_to]

Test plan

  • 19 parser unit tests pass (parameter extraction, returns, assignments, arg flows, mutations, nested scopes)
  • 20 integration tests pass (dataflowData, dataflowPathData, dataflowImpactData, empty table, pagination, --no-tests)
  • Full test suite: 1102 passed, 0 failed
  • Lint clean
  • Dogfood: build --dataflow on codegraph itself produces 1439 edges
  • CLI commands: dataflow, --path, --impact, -j all work

Track how data moves through functions with three new edge types:
- flows_to: parameter/variable flows into another function as argument
- returns: call return value is captured by the caller
- mutates: parameter-derived value is mutated in-place

Opt-in via `build --dataflow` (JS/TS only for MVP). Adds schema
migration v10 (dataflow table), extractDataflow() AST walker with
scope tracking and confidence scoring, query functions (dataflowData,
dataflowPathData, dataflowImpactData), CLI command with --path and
--impact modes, MCP tool, batch support, and programmatic API exports.

Impact: 29 functions changed, 33 affected
@claude
Copy link

claude bot commented Mar 3, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Dataflow Analysis Feature

Reviewing PR #254 for dataflow analysis functionality.

Todo List

  • Phase 1: Root Cause Analysis - understand the problem being solved
  • Phase 2: Configuration Change Detection - check for risky config changes
  • Phase 3: Test Integrity - verify test quality and coverage
  • Phase 4: Code Quality - review implementation for bugs/security/performance
  • Final Assessment and Rating

Phase 1: Root Cause Analysis ✅
This PR adds legitimate new functionality - dataflow analysis to track how data moves through functions via three edge types: flows_to, returns, and mutates. This addresses a real need for understanding data dependencies in large codebases. The opt-in nature (--dataflow flag) is appropriate for MVP.

Phase 2: Configuration Change Detection ✅
No configuration relaxations detected. Changes are purely additive:

  • New CLI flag --dataflow
  • New database table dataflow with proper migration (v10)
  • No lint rules disabled or test thresholds lowered
  • No build configuration modifications

Phase 3: Test Integrity ✅
Strong test coverage with proper quality:

  • 19 parser unit tests covering parameter extraction, returns, assignments, argument flows, mutations, nested scopes, destructuring, arrow functions
  • 20 integration tests with hand-crafted topology testing dataflow queries, path finding, impact analysis, edge cases
  • Tests use proper assertions (expect().toEqual(), expect().toBe(), expect().toHaveLength())
  • No weakened assertions or conditional test logic
  • Tests verify both positive and negative cases (found/not found, valid/invalid inputs)

Phase 4: Code Quality ✅

Security: Properly secured against SQL injection - all queries use prepared statements with parameterized inputs. File operations use safe path joining and proper error handling.

Performance:

  • Incremental cleanup of dataflow table when files change
  • Proper database indexing on source_id, target_id, kind
  • Efficient BFS algorithms with visit tracking to prevent cycles
  • WASM parser fallback only when needed
  • Transaction-wrapped bulk inserts

Architecture:

  • Clean separation: AST extraction → edge building → query functions → display formatting
  • Proper scope tracking with stack-based approach
  • Confidence scoring system for data flow accuracy
  • Error handling with graceful degradation (continues processing on individual file failures)

Code Quality:

  • Well-documented with clear function signatures
  • Appropriate constants (MUTATING_METHODS, DATAFLOW_LANG_IDS)
  • Good error handling throughout
  • Follows existing codebase patterns (pagination, CLI option handling)
  • Tree-sitter AST access is safe (null checks, type validation)

Potential Issues:

  • None significant found. The implementation is solid.

Root Cause Analysis

Problem Being Solved: Adding dataflow analysis to track data movement through functions with flows_to, returns, and mutates edges - a legitimate enhancement for understanding code dependencies.

Why This Approach: Opt-in via --dataflow flag is appropriate for MVP. Uses existing SQLite infrastructure with proper schema migration. JS/TS-only initially is reasonable scope.

Risk Assessment: Low risk. No configuration changes, strong test coverage, proper security practices.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation with comprehensive testing, proper security practices, and clean architecture. The feature addresses a real need with appropriate MVP scope. Author has done thorough dogfooding verification. Only minor consideration is the large file size (1200+ lines) but it's well-organized and documented.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@greptile-apps
Copy link

greptile-apps bot commented Mar 3, 2026

Greptile Summary

Adds dataflow analysis to track how data moves through functions with three new edge types: flows_to (parameter/variable passed as argument), returns (return value captured), and mutates (parameter mutation). Opt-in via --dataflow flag, JS/TS only for MVP.

Key changes:

  • New src/dataflow.js with AST walker, scope tracking, BFS path finding, and impact analysis
  • Schema migration v10 adds dataflow table with foreign keys and indexes
  • Integrated into build pipeline as optional second pass after complexity analysis
  • New dataflow CLI command with --path and --impact modes
  • MCP tool with three modes (edges/path/impact) and batch command support
  • 39 tests (19 parser + 20 integration) with 100% pass rate

Implementation quality:

  • Clean separation: extraction → edge building → querying
  • Proper incremental build support with cleanup for changed files
  • Confidence scoring for flow tracking (0.5-1.0 based on binding type)
  • Comprehensive test coverage including destructuring, defaults, rest params, nested scopes

Known limitations (acceptable for MVP):

  • Spread arguments (foo(...args)) not tracked
  • Optional chaining (foo?.()) not handled
  • Subscript expressions (arr[0]) not tracked as argument flows
  • Non-declaration assignments (x = foo() without const/let/var) don't create returns edges

Confidence Score: 4/5

  • Safe to merge - well-tested feature addition with clean integration and comprehensive test coverage
  • High confidence based on: 39 passing tests with no failures, clean schema migration, proper incremental build support, opt-in design minimizes risk. Minor edge cases in argument tracking (spread, optional chaining, subscripts) are acceptable limitations for MVP and don't affect core functionality.
  • No files require special attention - all changes follow existing patterns and are well-integrated

Important Files Changed

Filename Overview
src/dataflow.js New dataflow analysis engine with AST walker, scope tracking, and query functions; well-structured with comprehensive edge case handling
src/db.js Clean migration v10 adds dataflow table with proper foreign keys and indexes
src/builder.js Proper integration of dataflow phase with incremental build cleanup; opt-in via --dataflow flag
src/cli.js New dataflow command with path/impact modes and standard filtering options
src/mcp.js MCP tool integration with three modes (edges, path, impact) and proper parameter handling
tests/parsers/dataflow-javascript.test.js 19 unit tests covering parameter extraction, returns, assignments, arg flows, mutations, and nested scopes
tests/integration/dataflow.test.js 20 integration tests for query functions with hand-crafted dataflow topology

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[build --dataflow] --> B[Parse files with tree-sitter]
    B --> C[Extract symbols & complexity]
    C --> D{dataflow flag set?}
    D -->|No| E[Complete build]
    D -->|Yes| F[extractDataflow AST walk]
    F --> G[Track parameters & scope]
    F --> H[Track return statements]
    F --> I[Track call arg flows]
    F --> J[Track mutations]
    G --> K[buildDataflowEdges]
    H --> K
    I --> K
    J --> K
    K --> L[Resolve function names to node IDs]
    L --> M[Insert flows_to edges]
    L --> N[Insert returns edges]
    L --> O[Insert mutates edges]
    M --> P[(dataflow table)]
    N --> P
    O --> P
    P --> Q[dataflow command]
    Q --> R{Mode?}
    R -->|edges| S[dataflowData: show flows/returns/mutates]
    R -->|path| T[dataflowPathData: BFS A→B]
    R -->|impact| U[dataflowImpactData: return-dependent blast radius]
    S --> V[CLI/MCP/Batch output]
    T --> V
    U --> V
Loading

Last reviewed commit: 45eb976

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@carlos-alm
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

Address review feedback from Greptile:
- Track spread arguments (foo(...args)) by unwrapping spread_element
- Handle optional chaining (foo?.bar()) in callee name resolution
- Track non-declaration assignments (x = foo() without const/let/var)
  as returns edges
- Add 3 tests covering these cases

Impact: 3 functions changed, 3 affected
@carlos-alm
Copy link
Contributor Author

Addressed the known limitations identified in the Greptile review (8c82a14):

  1. Spread arguments (foo(...args)) — now unwraps spread_element nodes to track the underlying variable
  2. Optional chaining (foo?.bar()) — resolveCalleeName now handles optional_chain_expression to resolve callee names through ?.
  3. Non-declaration assignments (x = foo() without const/let/var) — assignment_expression handler now detects call RHS and creates returns edges + local bindings

Added 3 new parser unit tests covering each case. Full suite: 1105 tests pass.

@greptileai

@claude
Copy link

claude bot commented Mar 3, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Dataflow Analysis Feature

Reviewing PR #254 for dataflow analysis functionality and the recent fixes addressing Greptile feedback.

Todo List

  • Phase 1: Root Cause Analysis - understand the problem being solved
  • Phase 2: Configuration Change Detection - check for risky config changes
  • Phase 3: Test Integrity - verify test quality and coverage
  • Phase 4: Code Quality - review implementation for bugs/security/performance
  • Final Assessment and Rating

Phase 1: Root Cause Analysis ✅
This PR adds legitimate new functionality - dataflow analysis to track how data moves through functions via three edge types: flows_to, returns, and mutates. This addresses a real need for understanding data dependencies in large codebases. The opt-in nature (--dataflow flag) is appropriate for MVP.

Phase 2: Configuration Change Detection ✅
No configuration relaxations detected. Changes are purely additive:

  • New CLI flag --dataflow
  • New database table dataflow with proper migration (v10)
  • No lint rules disabled or test thresholds lowered
  • No build configuration modifications

Phase 3: Test Integrity ✅
Strong test coverage with proper quality:

  • 19 parser unit tests covering parameter extraction, returns, assignments, argument flows, mutations, nested scopes, destructuring, arrow functions
  • 20 integration tests with hand-crafted topology testing dataflow queries, path finding, impact analysis, edge cases
  • Tests use proper assertions (expect().toEqual(), expect().toBe(), expect().toHaveLength())
  • No weakened assertions or conditional test logic
  • Tests verify both positive and negative cases (found/not found, valid/invalid inputs)

Phase 4: Code Quality ✅

Security: Properly secured against SQL injection - all queries use prepared statements with parameterized inputs. File operations use safe path joining and proper error handling.

Performance:

  • Incremental cleanup of dataflow table when files change
  • Proper database indexing on source_id, target_id, kind
  • Efficient BFS algorithms with visit tracking to prevent cycles
  • WASM parser fallback only when needed
  • Transaction-wrapped bulk inserts

Architecture:

  • Clean separation: AST extraction → edge building → query functions → display formatting
  • Proper scope tracking with stack-based approach
  • Confidence scoring system for data flow accuracy
  • Error handling with graceful degradation (continues processing on individual file failures)

Code Quality:

  • Well-documented with clear function signatures
  • Appropriate constants (MUTATING_METHODS, DATAFLOW_LANG_IDS)
  • Good error handling throughout
  • Follows existing codebase patterns (pagination, CLI option handling)
  • Tree-sitter AST access is safe (null checks, type validation)

Recent Fixes (8c82a14):

  • Spread arguments fixed: foo(...args) now unwraps spread_element nodes to track underlying variables
  • Optional chaining fixed: foo?.bar() properly handled in callee name resolution
  • Non-declaration assignments fixed: x = foo() now creates returns edges without const/let/var

Potential Issues:

  • None significant found. The implementation is solid.

Root Cause Analysis

Problem Being Solved: Adding dataflow analysis to track data movement through functions with flows_to, returns, and mutates edges - a legitimate enhancement for understanding code dependencies.

Why This Approach: Opt-in via --dataflow flag is appropriate for MVP. Uses existing SQLite infrastructure with proper schema migration. JS/TS-only initially is reasonable scope.

Risk Assessment: Low risk. No configuration changes, strong test coverage, proper security practices.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: High-quality implementation with comprehensive testing, proper security practices, and clean architecture. The feature addresses a real need with appropriate MVP scope. Author has done thorough dogfooding verification. Only minor consideration is the large file size (1200+ lines) but it's well-organized and documented.

@carlos-alm
Copy link
Contributor Author

recheck

@carlos-alm carlos-alm merged commit 0b2242c into main Mar 3, 2026
16 of 17 checks passed
@carlos-alm carlos-alm deleted the feat/dataflow-analysis branch March 3, 2026 00:17
@github-actions github-actions bot locked and limited conversation to collaborators Mar 3, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant