Skip to content

Flatten AnalyzeResult to top-level nodes/edges (#19)#33

Merged
melonamin merged 17 commits intomasterfrom
feat/issue-19-flatten-lineage-model
Apr 15, 2026
Merged

Flatten AnalyzeResult to top-level nodes/edges (#19)#33
melonamin merged 17 commits intomasterfrom
feat/issue-19-flatten-lineage-model

Conversation

@melonamin
Copy link
Copy Markdown
Member

Closes #19.

Summary

  • Replace per-statement StatementLineage.nodes/edges + the parallel GlobalLineage with a single flat graph at AnalyzeResult.{nodes, edges} plus statements: StatementMeta[] for per-statement metadata only.
  • Fold canonicalName and statementIds onto Node; statementIds onto Edge (cross_statement edges carry [producer, consumer]).
  • Preserve self-join instance distinctness in the flat graph: tables with instance-specific IDs (hashed canonical+alias+scope) are not collapsed into one canonical node, so users a JOIN users b renders as two nodes. Cross-statement merging by canonical identity still applies.
  • Drop GlobalLineage, GlobalNode, GlobalEdge, StatementRef (Rust + TS). StatementLineage becomes a crate-private analyzer intermediate.
  • Add helpers AnalyzeResult::nodes_in_statement(idx) / edges_in_statement(idx) for projecting the flat graph back into a per-statement view.

Migration scope

  • flowscope-core: model + analyzer/global.rs flatten step + cross_statement.rs rewritten; ~280 test assertions migrated.
  • flowscope-export: schema redesigned — drops global_* tables, adds node_statements / edge_statements / node_name_spans junctions; SQL + DuckDB + extract + mermaid + join_export backends migrated.
  • flowscope-cli, flowscope-wasm: integration tests + output/table.rs migrated.
  • packages/core: TS types regenerated; docs/api_schema.json regenerated via just update-schema.
  • packages/react: GraphView, MatrixView, workers, hooks, utils migrated. Internal hydrateStatements() rebuilds the per-statement view at the entry points to keep the graph builder diff small (follow-up: PR to read the flat shape directly will land separately).
  • vscode, app/: hover/codeLens/lineage panel + AnalysisView/HierarchyView/schema-parser migrated.
  • docs/api-types.md + docs/core-engine-spec.md updated.
  • 36 insta snapshots regenerated.

Test plan

  • cargo test --workspace — 2500+ tests pass, 0 fail
  • just check — fmt + clippy + typecheck + Rust tests + schema-compat all green
  • yarn workspaces run test --silent — 131 react + 41 core + 3 schema-compat all pass
  • just check-schema — Rust JSON Schema and TS shape agree
  • Manual smoke test of the demo app graph view (recommended before merge)

Replaces StatementLineage.nodes/edges + GlobalLineage with a single flat
graph at AnalyzeResult.{nodes,edges}, plus a StatementMeta vec for
per-statement metadata. Fold canonical_name + statement_ids onto Node;
edges carry statement_ids (cross-statement edges hold [producer, consumer]).

State at this commit:
- Rust production code compiles workspace-wide
  - flowscope-core: model + analyzer/global.rs + cross_statement migrated
  - flowscope-export: schema redesigned (drops global_*, adds
    node_statements / edge_statements / node_name_spans junctions),
    duckdb + sql backends + extract + mermaid + join_export migrated
  - flowscope-cli: output/table.rs migrated
- Tests still broken (~200 errors). Mechanical patterns:
  - result.global_lineage.{nodes,edges} -> result.{nodes,edges} (sed-able)
  - stmt.{nodes,edges} -> result.{nodes_in_statement,edges_in_statement}(stmt.statement_index)
  - node.statement_refs -> node.statement_ids
  - edge.{producer,consumer}_statement -> edge.statement_ids[0/1]
  - node.canonical_name.X -> node.canonical_name.as_ref().map_or(.., |c| &c.X)
- Insta snapshots will regenerate after tests pass
- TS bindings (packages/core/src/types.ts), packages/react, vscode/, app/,
  docs all still on the old shape

Helper added: AnalyzeResult::nodes_in_statement / edges_in_statement
let consumers project the flat graph back down to a per-statement view.
Migrate packages/core types to the flat lineage model: statements carry
metadata only, with nodes/edges hoisted to the top level and each item
tracking the statementIds it participates in. Drop GlobalLineage,
GlobalNode, GlobalEdge, StatementRef, and StatementLineage; add
StatementMeta plus nodesInStatement/edgesInStatement helpers.
Update the standalone vscode types mirror, hover/codelens providers,
and the lineage panel to filter the flat nodes/edges by statementIds
instead of reading from per-statement subgraphs or globalLineage.
Redefine StatementLineage locally in packages/react as a per-statement
view hydrated from the flat result (statement metadata + the subset of
result.nodes/edges that list the statement in statementIds). Add a
hydrateStatements helper and use it at the GraphView / MatrixView entry
points so the graph-building utilities and workers can keep operating
on the legacy per-statement shape.

Drop the globalLineage/globalNode indirection used for canonical-name
lookup; canonicalName is now carried on the Node itself. Migrate
useGraphSearch, useSearchSuggestions, TableFilterDropdown, ColumnPanel,
and nodeOccurrences.findMergedNodeById to read the flat graph directly.
Update schema-parser, useIssueLocations, useDebugData, AnalysisView, and
HierarchyView to read from the top-level result.nodes/edges and the
per-node statementIds / canonicalName fields instead of the removed
globalLineage/statementRefs indirection.
The flatten step in analyzer/global.rs was collapsing self-join instances
of the same table back into a single node, contradicting the doc comment
that said "Self-join instances remain distinct". Two fixes:

1. global_node_id for Table/View now keeps the local node ID when it
   differs from relation_identity(canonical) — i.e. when the analyzer
   minted an instance-specific ID via relation_instance_identity
   (canonical+alias+scope hash). The canonical-only ID is still used for
   simple references that share canonical identity across statements.

2. statement_scoped_relation_ids in flatten_lineages now also includes
   self-join instance table nodes, so columns owned by those instances
   stay statement-scoped (otherwise e1.name and e2.name in
   `users e1 JOIN users e2` would reconnect through a shared global
   column node).

Updates the four self_join_global_lineage_* tests that had previously
asserted the old "merge by canonical" semantics; those names were
renamed and assertions inverted to expect distinct per-instance nodes,
which is the new (correct) behavior.

CTE self-joins remain collapsed at the analyzer level (the analyzer
does not synthesize per-instance CTE column nodes the way it does for
base tables) — documented in
global_lineage_merges_qualified_columns_across_self_joins_and_cte_instances.

State after this commit:
- cargo test --workspace: 280/281 lineage_engine pass; 36 insta
  snapshots need regeneration via `cargo insta accept` (10 in
  tests/golden.rs, 26 in tests/snapshots.rs). All other tests green.
- TS migration committed in agent commits (a7b9960, b819416, fab863d,
  42e90ce). just typecheck + check-schema clean.
- docs/api-types.md: replace StatementLineage/GlobalLineage section with
  flat AnalyzeResult shape, document canonicalName + statementIds on
  Node and Edge, document cross_statement [producer, consumer] ordering.
- docs/core-engine-spec.md: rewrite Lineage Graph Output to describe the
  single flat graph and the self-join instance preservation rule.
- analyzer/global.rs: clippy::manual_contains — replace iter().any() on
  Vec<Span> with .contains().

Also picks up Prettier formatting on the regenerated TS bindings and
api_schema.json from `just fmt-ts`.
Delete the per-statement `StatementLineage` compatibility layer and the
`hydrateStatements` helper introduced alongside the flat `AnalyzeResult`
migration. All consumers now read `result.nodes` / `result.edges` /
`result.statements` directly, using `nodesInStatement` /
`edgesInStatement` from `@pondpilot/flowscope-core` for per-statement
views.

- Introduce `MergedLineage` + `mergeAnalyzeResult` in graphBuilders.ts
  and the graph builder worker to replace the legacy merged statement
  shape; the builders now accept a merged view or the full AnalyzeResult
  instead of a per-statement lineage.
- matrixUtils / matrix worker / matrix worker service take AnalyzeResult.
- graphBuilder worker + service take AnalyzeResult; lineageHelpers
  `getCreatedRelationNodeIds` takes (statementType, nodes, edges).
- Delete unused `mergeStatementNodesForNavigation`,
  `StatementLineageWithSource`, `mergeStatements` and the legacy
  `normalizeStatement` / `withSourceName` helpers in the worker.
- Update MatrixView + GraphView to pass the flat result straight to the
  worker services.
- Update `graphBuilders.test.ts` and `matrixView.test.ts` to build
  AnalyzeResult fixtures via a local `toResult` helper that preserves
  per-statement node instances (matching `buildMultiStatementResult` in
  occurrenceCycling tests).
- Split Analyzer::flatten_lineages into focused helpers
  (collect_statement_scoped_ids, merge_lineage_nodes, merge_lineage_edges,
  append_cross_statement_edges, finalize_nodes, finalize_edges).
- Expose STATEMENT_FILTERS_METADATA_KEY and add Node::filters_for_statement
  so per-statement filter lookups are centralized on the type.
- Harden per-statement filter serialization with .expect to surface
  serialization bugs instead of silently dropping filters.
- Standardize statement-scoped iteration on nodesInStatement /
  edgesInStatement in VSCode providers and ColumnPanel; add the helpers
  to the VSCode types module.
- Document SQL export identifier-quoting caveats on the export crate.
- Add regression test ensuring the SQL joins export dedups column-level
  edges per statement while preserving one row per statement.
Restore per-statement aggregation and filter semantics after flattening AnalyzeResult. Export backends now write filters and aggregations per node+statement, and UI statement scoping reads the preserved aggregation metadata.

Also exclude virtual output nodes from React search suggestions and add regressions for the flattened graph/export behavior.
- Drop redundant statement_ids reassignment in merge_lineage_edges; the
  struct literal already overrides the spread value.
- Invalidate CodeLens and Hover caches on flowscope config changes so
  dialect switches don't serve stale analysis.
@melonamin melonamin merged commit 5a27f34 into master Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Model: flatten GlobalLineage into top-level nodes/edges with statementIds

1 participant