Skip to content

feat: Architectural improvements and module decomposition#53

Merged
mingjerli merged 13 commits intomainfrom
fix/mechanical-review-fixes
Apr 12, 2026
Merged

feat: Architectural improvements and module decomposition#53
mingjerli merged 13 commits intomainfrom
fix/mechanical-review-fixes

Conversation

@mingjerli
Copy link
Copy Markdown
Owner

@mingjerli mingjerli commented Feb 7, 2026

Summary

This PR combines architectural security improvements (Items 7, 10) with a comprehensive module decomposition of the two largest files in the codebase.

Security & Validation (original PR #53 scope)

  • Path validation (path_validation.py) — TOCTOU-safe file reading, symlink protection, Windows reserved name detection, Unicode normalization
  • Prompt injection mitigation (prompt_sanitization.py) — 4-layer defense with input sanitization, Unicode NFKC normalization, output validation, sqlglot-based SQL validation

Module Decomposition (new)

Decomposes pipeline.py (2,795 → 1,247 lines) and lineage_builder.py (3,419 → 1,366 lines) into focused, independently testable modules:

New Module Lines Responsibility Extracted From
pipeline_lineage_builder.py 744 Cross-query lineage building pipeline.py
pipeline_factory.py 379 All Pipeline.from_* factory methods pipeline.py
lineage_tracer.py 446 BFS backward/forward column tracing pipeline.py
node_factory.py 404 Node/edge creation helpers lineage_builder.py
trace_strategies.py 388 Column dependency tracing strategies lineage_builder.py
column_extractor.py 354 UNION/PIVOT/UNPIVOT/MERGE extraction lineage_builder.py
aggregate_parser.py 360 Aggregate spec parsing (pure functions) lineage_builder.py
lineage_utils.py 592 JSON/schema/nested-access utilities lineage_builder.py
sql_column_tracer.py 296 SQLColumnTracer wrapper class lineage_builder.py
tvf_registry.py 72 TVF constants and registry query_parser.py
metadata_manager.py 185 Metadata propagation/generation pipeline.py
pipeline_validator.py 169 Validation issue management pipeline.py
subpipeline_builder.py 183 Pipeline splitting/subpipeline pipeline.py

Approach

  • Extract, don't rewrite — code moved verbatim, no behavioral changes
  • Full backward compatibility — all public imports preserved via re-exports
  • TDD — each extraction: write failing test → extract → verify pass → commit
  • Zero regressions — all 1,282 tests pass

File Size Reductions

File Before After Reduction
pipeline.py 2,795 1,247 -55%
lineage_builder.py 3,419 1,366 -60%
query_parser.py 2,354 2,301 -2%

Test plan

  • All 1,282 tests pass (778 original + 504 new)
  • ruff check and ruff format clean
  • CI green on Python 3.10, 3.11, 3.12, 3.13
  • Backward compat verified: from clgraph import Pipeline, SQLColumnTracer, PipelineLineageBuilder
  • No circular imports
  • 71 path validation tests (100% coverage)
  • 100 prompt sanitization tests (95% coverage)
  • 80%+ coverage on all new decomposition modules

## Item 7: Path Validation
- Add path_validation.py with PathValidator class
- TOCTOU-safe file reading via _safe_read_sql_file()
- Symlink protection with opt-in allow_symlinks parameter
- Windows reserved name detection
- Unicode normalization for homoglyph attack prevention
- 100% test coverage (71 tests)

## Item 10: Prompt Injection Mitigation
- Add prompt_sanitization.py with 4-layer defense
- Input sanitization with tag escaping (not removal)
- Unicode NFKC normalization for Cyrillic bypass prevention
- Output validation with semantic relevance checking
- sqlglot-based SQL validation for destructive operations
- Environment variable CLGRAPH_DISABLE_PROMPT_SANITIZATION for debugging
- 95% test coverage (100 tests)

## Item 9: File Splitting
- Extract lineage_utils.py from lineage_builder.py (~592 lines)
- Extract sql_column_tracer.py from lineage_builder.py (~296 lines)
- Extract tvf_registry.py from query_parser.py (~72 lines)
- Maintain backward compatibility via re-exports

## Item 8: Pipeline Decomposition
- Extract LineageTracer component (~400 lines)
- Extract MetadataManager component (~185 lines)
- Extract PipelineValidator component (~169 lines)
- Extract SubpipelineBuilder component (~183 lines)
- Pipeline now uses facade pattern with lazy initialization
- All 1,052 existing tests pass without modification

File size reductions:
- pipeline.py: 2,795 → 2,426 lines
- lineage_builder.py: 3,419 → 2,666 lines
- query_parser.py: 2,354 → 2,313 lines
Move factory method bodies (create_from_tuples, create_from_dict,
create_from_sql_list, create_from_sql_string, create_from_json,
create_from_json_file, create_from_sql_files, create_empty, and the
generate_query_id helper) to a new pipeline_factory module.

Pipeline classmethods are retained as thin delegators to preserve the
public API. Internal cross-calls within pipeline_factory use local
functions directly to avoid circular imports.
Move 5 pure methods from RecursiveLineageBuilder to a standalone module:
parse_aggregate_spec, get_aggregate_func_name, infer_aggregate_return_type,
has_star_in_aggregate, unit_has_fully_resolved_columns.

Add 63 unit tests with 81% coverage for the new module.
Extract the five branching strategies from the 356-line
_trace_column_dependencies method into focused strategy functions in
src/clgraph/trace_strategies.py:

- trace_star_passthrough: SELECT * / star column handling
- trace_aggregate_star: COUNT(*) and aggregates with *
- trace_set_operation: UNION/INTERSECT/EXCEPT column edges
- trace_merge_columns: MERGE statement column tracking
- trace_regular_columns: Normal column refs with UNNEST/TVF/VALUES sub-cases

The builder method is now a compact ~45-line dispatcher. Resolution
methods (_resolve_source_unit, _resolve_base_table_name,
_find_column_in_unit, _get_default_from_table) stay on the builder and
are passed as callables to strategy functions to avoid coupling.

Adds tests/test_trace_strategies.py with 17 behavior-preserving tests
exercising all five branches via the public RecursiveLineageBuilder API.

1027 passed, 40 skipped.
@mingjerli mingjerli changed the title feat: Implement architectural improvements (Items 7-10) feat: Architectural improvements and module decomposition Apr 12, 2026
@mingjerli mingjerli merged commit ab72d4c into main Apr 12, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant