Phase 2: Testing framework and data quality checks#4
Merged
Conversation
DataQuality transformation: - DataQualityPlugin: row_count, missing_count, duplicate_count, string_length checks - onFail behavior: abort (throw), warn (log), skip (silent) - Passthrough transform — validates then passes DataFrame unchanged - Registered via ServiceLoader Testing framework: - InlineTestConfig: tests section in pipeline YAML with assert expressions - TestRunner: evaluates assertions against pipeline execution results - PipelineExecutor.executeAndCapture: runs pipeline and returns all DataFrames - Support for sinks.X.row_count, missing_count(col), duplicate_count(col) CLI: - weaver test: execute pipeline and run inline tests - weaver test --coverage: show which transforms/sinks have tests - weaver test --auto-generate: placeholder for future schema-based generation Config changes: - PipelineConfig.tests: List[InlineTestConfig] - TransformationConfig.checks + onFail fields - YAMLParser updated with decoders for new fields CI: - GitHub Actions workflow: JDK 17, compile, test, format check, assembly
- delta-spark removed as compile dependency (optional runtime via reflection) - DeltaLakeSinkConnector uses reflection for merge to avoid ClassNotFoundException - project/target/ files cleaned from git tracking - .gitignore already covers **/target/
setup-java@v4 with cache: 'sbt' automatically installs and caches SBT. Removed manual cache step and scalafmtCheck (not configured yet).
setup-java cache:'sbt' only caches dirs, doesn't install SBT. Added explicit sbt/setup-sbt@v1 step + proper dependency cache.
PipelineExecutor now enriches the config map with the dataSource's id and query fields before passing to connectors. Fixes TestSourceConnector which needs 'id' to locate the test JSON file.
Avoids classloader conflict from creating a second SparkSession via EngineSelector/ApplyCommand. Uses the existing test SparkSession with PipelineExecutor.execute() directly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a complete testing framework and data quality transformation for pipeline validation.
DataQuality transformation
DataQualityPlugin: checks DSL forrow_count,missing_count(col),duplicate_count(col),string_length(col)onFailbehavior:abort(throw exception),warn(log and continue),skip(silent)Testing framework
tests:section withassertexpressions)TestRunner: evaluates assertions against pipeline resultsPipelineExecutor.executeAndCapture(): returns all intermediate DataFramesCLI commands
weaver test <pipeline>— execute pipeline and run inline testsweaver test --coverage— show which transforms/sinks have testsweaver test --auto-generate— placeholder for future auto-generationCI/CD
Test plan
sbt compile— All 4 modules compilesbt core/test— 47 non-Spark tests passsbt testwith Java 17+ — all tests including Sparkweaver test <pipeline.yaml>runs inline testsweaver test --coverageshows coverage report