Skip to content

Use standard scala collections to unit test your Spark code.

Notifications You must be signed in to change notification settings

mikaelv/spark-lite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Lite

What it it?

This project offers a Type class that defines common operations on Spark Dataset[T]. There are implementations for Dataset, RDD, Vector. This allows to seamlessly switch from one implementation to another without changing your functions.

Usages

  • Unit testing: Starting a spark context and running jobs is slow when the number of rows is low. Using the Vector implementation speeds up your unit tests by an order of magnitude. It also allows you to debug your tests more easily.

  • Optimization: Sometimes it is not worth starting a Spark job when you know that the number of rows to process is low and can fit in one node. You could add some heuristics to your program to run a computation on only one node when the number of rows is lower than a certain threshold.

Dependencies

The project depends on Spark 2.0.2 and Scala 2.11

How to install

There is no release yet, just check out and copy in your project.

Examples

About

Use standard scala collections to unit test your Spark code.

Resources

Stars

Watchers

Forks

Packages

No packages published