Phase 2: Testing framework and data quality checks by netsirius · Pull Request #4 · netsirius/data-weaver

netsirius · 2026-04-06T08:37:19Z

Summary

Adds a complete testing framework and data quality transformation for pipeline validation.

DataQuality transformation

DataQualityPlugin: checks DSL for row_count, missing_count(col), duplicate_count(col), string_length(col)
onFail behavior: abort (throw exception), warn (log and continue), skip (silent)
Passthrough transform — validates data then passes DataFrame unchanged

Testing framework

Inline tests in pipeline YAML (tests: section with assert expressions)
TestRunner: evaluates assertions against pipeline results
PipelineExecutor.executeAndCapture(): returns all intermediate DataFrames

CLI commands

weaver test <pipeline> — execute pipeline and run inline tests
weaver test --coverage — show which transforms/sinks have tests
weaver test --auto-generate — placeholder for future auto-generation

CI/CD

GitHub Actions workflow: JDK 17, compile, test, format check, assembly

Test plan

sbt compile — All 4 modules compile
sbt core/test — 47 non-Spark tests pass
sbt test with Java 17+ — all tests including Spark
Manual: weaver test <pipeline.yaml> runs inline tests
Manual: weaver test --coverage shows coverage report

DataQuality transformation: - DataQualityPlugin: row_count, missing_count, duplicate_count, string_length checks - onFail behavior: abort (throw), warn (log), skip (silent) - Passthrough transform — validates then passes DataFrame unchanged - Registered via ServiceLoader Testing framework: - InlineTestConfig: tests section in pipeline YAML with assert expressions - TestRunner: evaluates assertions against pipeline execution results - PipelineExecutor.executeAndCapture: runs pipeline and returns all DataFrames - Support for sinks.X.row_count, missing_count(col), duplicate_count(col) CLI: - weaver test: execute pipeline and run inline tests - weaver test --coverage: show which transforms/sinks have tests - weaver test --auto-generate: placeholder for future schema-based generation Config changes: - PipelineConfig.tests: List[InlineTestConfig] - TransformationConfig.checks + onFail fields - YAMLParser updated with decoders for new fields CI: - GitHub Actions workflow: JDK 17, compile, test, format check, assembly

- delta-spark removed as compile dependency (optional runtime via reflection) - DeltaLakeSinkConnector uses reflection for merge to avoid ClassNotFoundException - project/target/ files cleaned from git tracking - .gitignore already covers **/target/

setup-java@v4 with cache: 'sbt' automatically installs and caches SBT. Removed manual cache step and scalafmtCheck (not configured yet).

setup-java cache:'sbt' only caches dirs, doesn't install SBT. Added explicit sbt/setup-sbt@v1 step + proper dependency cache.

PipelineExecutor now enriches the config map with the dataSource's id and query fields before passing to connectors. Fixes TestSourceConnector which needs 'id' to locate the test JSON file.

Avoids classloader conflict from creating a second SparkSession via EngineSelector/ApplyCommand. Uses the existing test SparkSession with PipelineExecutor.execute() directly.

netsirius added 6 commits April 6, 2026 10:37

fix(ci): add sbt to setup-java via cache parameter

9cd6082

setup-java@v4 with cache: 'sbt' automatically installs and caches SBT. Removed manual cache step and scalafmtCheck (not configured yet).

fix(ci): install SBT via sbt/setup-sbt@v1 action

f0691ec

setup-java cache:'sbt' only caches dirs, doesn't install SBT. Added explicit sbt/setup-sbt@v1 step + proper dependency cache.

fix: inject dataSource id and query into connector config map

2b55f7f

PipelineExecutor now enriches the config map with the dataSource's id and query fields before passing to connectors. Fixes TestSourceConnector which needs 'id' to locate the test JSON file.

fix(test): use PipelineExecutor directly in IntegrationTest

6e2d8e4

Avoids classloader conflict from creating a second SparkSession via EngineSelector/ApplyCommand. Uses the existing test SparkSession with PipelineExecutor.execute() directly.

netsirius merged commit f67fc8c into main Apr 6, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 2: Testing framework and data quality checks#4

Phase 2: Testing framework and data quality checks#4
netsirius merged 6 commits intomainfrom
feature/phase2-testing-quality

netsirius commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

netsirius commented Apr 6, 2026

Summary

DataQuality transformation

Testing framework

CLI commands

CI/CD

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant