hydra-spark provides a declarative and intuitive interface for creating and submitting [Apache Spark] (http://spark-project.org) data flow pipelines leveraging the flexibility of Spark's DataFrame API.
This repo contains the complete Hydra Spark project, including unit tests and deploy scripts.
- "Declarative Spark Jobs": Simple JSON/HOCON based syntax to describe Spark jobs
- Support for Hadoop, Hive, Kafka (both as a source and sink), Elastic Search and many others. See Sources.
- Support for both batch and Streaming jobs using a unified API.
- Supports different Spark deploy modes (local, yarn-client) which can also be overriden at the DSL level.
- Supports Scala 2.10 and 2.11
Version | Spark Version |
---|---|
master | 2.2 |
For release notes, look in the notes/
directory. They should also be up on notes.implicit.ly.
We host non-release jars at Jitpack.
The easiest way to get started is to try the Docker container which prepackages a Spark distribution with the Hydra Spark DSL assembly included.
Other ways to run:
- Build and run directly from an IDE. IntelliJ instructions follow below.
- Run using sbt
- Run sbt assembly and copy the jar to the Spark cluster.
The steps below show you how to use hydra-spark with an example DSL, by running Spark in-process. This is not an example of usage in production.
You need to have SBT installed.
If you are using a Scala IDE (such as IntelliJ), you can import the project and start by running any of the test specs. To run a specific DSL <>
Docs coming
Contributions via Github Pull Request are welcome. See the TODO for some ideas.
Profiling software provided by
YourKit supports open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of YourKit Java Profiler and YourKit .NET Profiler, innovative and intelligent tools for profiling Java and .NET applications.
Please report bugs/problems to: https://github.com/pluralsight/hydra-spark/issues
Apache 2.0, see LICENSE.md