Skip to content

Phase 2: Testing framework and data quality checks#4

Merged
netsirius merged 6 commits intomainfrom
feature/phase2-testing-quality
Apr 6, 2026
Merged

Phase 2: Testing framework and data quality checks#4
netsirius merged 6 commits intomainfrom
feature/phase2-testing-quality

Conversation

@netsirius
Copy link
Copy Markdown
Owner

Summary

Adds a complete testing framework and data quality transformation for pipeline validation.

DataQuality transformation

  • DataQualityPlugin: checks DSL for row_count, missing_count(col), duplicate_count(col), string_length(col)
  • onFail behavior: abort (throw exception), warn (log and continue), skip (silent)
  • Passthrough transform — validates data then passes DataFrame unchanged

Testing framework

  • Inline tests in pipeline YAML (tests: section with assert expressions)
  • TestRunner: evaluates assertions against pipeline results
  • PipelineExecutor.executeAndCapture(): returns all intermediate DataFrames

CLI commands

  • weaver test <pipeline> — execute pipeline and run inline tests
  • weaver test --coverage — show which transforms/sinks have tests
  • weaver test --auto-generate — placeholder for future auto-generation

CI/CD

  • GitHub Actions workflow: JDK 17, compile, test, format check, assembly

Test plan

  • sbt compile — All 4 modules compile
  • sbt core/test — 47 non-Spark tests pass
  • sbt test with Java 17+ — all tests including Spark
  • Manual: weaver test <pipeline.yaml> runs inline tests
  • Manual: weaver test --coverage shows coverage report

DataQuality transformation:
- DataQualityPlugin: row_count, missing_count, duplicate_count, string_length checks
- onFail behavior: abort (throw), warn (log), skip (silent)
- Passthrough transform — validates then passes DataFrame unchanged
- Registered via ServiceLoader

Testing framework:
- InlineTestConfig: tests section in pipeline YAML with assert expressions
- TestRunner: evaluates assertions against pipeline execution results
- PipelineExecutor.executeAndCapture: runs pipeline and returns all DataFrames
- Support for sinks.X.row_count, missing_count(col), duplicate_count(col)

CLI:
- weaver test: execute pipeline and run inline tests
- weaver test --coverage: show which transforms/sinks have tests
- weaver test --auto-generate: placeholder for future schema-based generation

Config changes:
- PipelineConfig.tests: List[InlineTestConfig]
- TransformationConfig.checks + onFail fields
- YAMLParser updated with decoders for new fields

CI:
- GitHub Actions workflow: JDK 17, compile, test, format check, assembly
- delta-spark removed as compile dependency (optional runtime via reflection)
- DeltaLakeSinkConnector uses reflection for merge to avoid ClassNotFoundException
- project/target/ files cleaned from git tracking
- .gitignore already covers **/target/
setup-java@v4 with cache: 'sbt' automatically installs and caches SBT.
Removed manual cache step and scalafmtCheck (not configured yet).
setup-java cache:'sbt' only caches dirs, doesn't install SBT.
Added explicit sbt/setup-sbt@v1 step + proper dependency cache.
PipelineExecutor now enriches the config map with the dataSource's
id and query fields before passing to connectors. Fixes TestSourceConnector
which needs 'id' to locate the test JSON file.
Avoids classloader conflict from creating a second SparkSession
via EngineSelector/ApplyCommand. Uses the existing test SparkSession
with PipelineExecutor.execute() directly.
@netsirius netsirius merged commit f67fc8c into main Apr 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant