Skip to content

A library you can include in your Spark job to validate the counters and perform operations on success. Goal is scala/java/python support.

License

Notifications You must be signed in to change notification settings

holdenk/spark-validator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

buildstatus codecov.io

Spark Validator

A library you can include in your Spark job to validate the counters and perform operations on success.

This software should be considered pre-alpha.

Why you should validate counters

Maybe you are really lucky and you never have intermitent outages or bugs in your code.

If you have accumulators for things like records processed or number of errors, its really easy to write bounds for these. Even if you don't have custom counters you can use Spark's built in metrics (bytes read, time, etc.) and by looking at historic values we can establish reasonable bounds. This can help catch jobs which fail to process some of your records. This is not a replacement for unit or integration testing.

How spark validation works

We store all of the metrics from each run along with all of the accumulators you pass in.

If a run is successful it will run your on success handler. If you just want to mark the run as success you can specify a file for spark validator to touch.

How to write your validation rules

Absolute

Relative

How to build

sbt - Remember when it was called the simple build tool?

sbt/sbt compile

How to use

Scala

At the start of your Spark program once you have constructed your spark context call

import com.holdenkarau.spark_validator
...
val rules = List(
    new AbsoluteValueRule(counter = "recordsRead", min=Some(1000), max=None).
    ...)
val vc = new ValidationConf(counterPath, jobName, firstTime, rules)
val vl = new Validation(vc)
...
validator.validate()

Java

vNext

Python

vNext+1

License

About

A library you can include in your Spark job to validate the counters and perform operations on success. Goal is scala/java/python support.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published