Sparkling Water provides H2O functionality inside Spark cluster
Scala Python Java Jupyter Notebook Groovy Shell Other
Switch branches/tags
Clone or download
Permalink
Failed to load latest commit information.
apps/streaming [SW-941] Upgrade Gradle to 4.9 (#828) Jul 17, 2018
assembly-h2o [SW-752] Collect stack traces on each h2o node as part of log collect… ( Mar 8, 2018
assembly [SW-941] Upgrade Gradle to 4.9 (#828) Jul 17, 2018
bin [SW-898] Allow to use S3A import on Spark 2.3.x (#796) Jul 9, 2018
core [SW-941] Upgrade Gradle to 4.9 (#828) Jul 17, 2018
dist [SW-880] Update Hadoop version on download page (#760) Jun 18, 2018
doc [SW-925] Fix missing aposthrope in documentation (#814) Jul 11, 2018
docker [SW-941] Upgrade Gradle to 4.9 (#828) Jul 17, 2018
examples [SW-929] Disable temporarily AutoML tests in Sparkling Water (#818) Jul 12, 2018
extension-stack-trace [SW-752] Collect stack traces on each h2o node as part of log collect… ( Mar 8, 2018
gradle [SW-941] Upgrade Gradle to 4.9 (#828) Jul 17, 2018
jars [SW-898] Allow to use S3A import on Spark 2.3.x (#796) Jul 9, 2018
jenkins [SW-940] Update docker image to 6 (#827) Jul 16, 2018
ml [SW-941] Upgrade Gradle to 4.9 (#828) Jul 17, 2018
package [SW-504] Provides uber Sparkling Water Spark package (#352) Nov 8, 2017
py [SW-929] Disable temporarily AutoML tests in Sparkling Water (#818) Jul 12, 2018
repl [SW-883] Add mising description in publish.gradle (#762) Jun 18, 2018
templates [SW-889] Port AWS preparation scripts into SW codebase (#769) Jun 21, 2018
.gitattributes Basic infrastructure. Oct 14, 2014
.gitignore [SW-898] Allow to use S3A import on Spark 2.3.x (#796) Jul 9, 2018
.travis.yml [SW-863] Integrate with Spark 2.3.1 (#753) Jun 14, 2018
CHANGELOG.md [SW-509] Add back DEVEL.md and CHANGELOG.md and redirect to new versions Aug 2, 2017
DEVEL.md [SW-799] Update master with the new documentation link (#667) Apr 16, 2018
LICENSE Initial commit Oct 13, 2014
README.rst [SW-801] Fix Typo (#670) Apr 16, 2018
_config.yml [SW-497] Spark 2.2 integration (#346) Aug 23, 2017
build.gradle [SW-942] Use plugins dsl for docker plugin (#831) Jul 17, 2018
gradle.properties [SW-922] Upgrade H2O to 3.20.0.3 (#811) Jul 11, 2018
gradlew [SW-490] Upgrade to Gradle 4.0.1 Jul 12, 2017
gradlew.bat [BUILD] Gradle upgrade (#139) Nov 30, 2016
make-dist.sh [SW-898] Allow to use S3A import on Spark 2.3.x (#796) Jul 9, 2018
settings.gradle [SW-941] Upgrade Gradle to 4.9 (#828) Jul 17, 2018

README.rst

Sparkling Water

Documentation Join the chat at https://gitter.im/h2oai/sparkling-water image1 image2 image3 Powered by H2O.ai

Sparkling Water integrates H2O's fast scalable machine learning engine with Spark. It provides:

  • Utilities to publish Spark data structures (RDDs, DataFrames, Datasets) as H2O's frames and vice versa.
  • DSL to use Spark data structures as input for H2O's algorithms.
  • Basic building blocks to create ML applications utilizing Spark and H2O APIs.
  • Python interface enabling use of Sparkling Water directly from PySpark.

Getting Started

User Documentation

Select right version

The Sparkling Water is developed in multiple parallel branches. Each branch corresponds to a Spark major release (e.g., branch rel-2.3 provides implementation of Sparkling Water for Spark 2.3).

Please, switch to the right branch:

  • For Spark 2.3 use branch rel-2.3

  • For Spark 2.2 use branch rel-2.2

  • For Spark 2.1 use branch rel-2.1

    Note: The master branch includes the latest changes for the latest Spark version. They are back-ported into older Sparkling Water versions.

Requirements

  • Linux/OS X/Windows
  • Java 8+
  • Python 2.7+ For Python version of Sparkling Water (PySparkling)
  • Spark 2.3 and SPARK_HOME shell variable must point to your local Spark installation

Download Binaries

For each Sparkling Water you can download binaries here:

Maven

Each Sparkling Water release is published into Maven central. Published artifacts are provided with the following Scala versions:

  • Sparkling Water 2.3.x - Scala 2.11
  • Sparkling Water 2.2.x - Scala 2.11
  • Sparkling Water 2.1.x - Scala 2.11

The artifacts coordinates are:

  • ai.h2o:sparkling-water-core_{{scala_version}}:{{version}} - Includes core of Sparkling Water

  • ai.h2o:sparkling-water-examples_{{scala_version}}:{{version}} - Includes example applications

  • ai.h2o:sparkling-water-repl_{{scala_version}}:{{version}} - Spark REPL integration into H2O Flow UI

  • ai.h2o:sparkling-water-ml_{{scala_version}}:{{version}} - Extends Spark ML package by H2O-based transformations

  • ai.h2o:sparkling-water-package_{{scala_version}}:{{version}} - Uber Sparkling Water package referencing all available Sparkling Water modules. This is designed to use as Spark package via --packages option

    Note: The {{version}} references to a release version of Sparkling Water, the {{scala_version}} references to Scala base version (For Sparkling Water 2.3 only 2.11). For example: ai.h2o:sparkling-water-examples_2.11:2.3.1

The full list of published packages is available here.


Use Sparkling Water

Sparkling Water is distributed as a Spark application library which can be used by any Spark application. Furthermore, we provide also zip distribution which bundles the library and shell scripts.

There are several ways of using Sparkling Water:

  • Sparkling Shell
  • Sparkling Water driver
  • Spark Shell and include Sparkling Water library via --jars or --packages option
  • Spark Submit and include Sparkling Water library via --jars or --packages option
  • PySpark with PySparkling

Run Sparkling shell

The Sparkling shell encapsulates a regular Spark shell and append Sparkling Water library on the classpath via --jars option. The Sparkling Shell supports creation of an H2O cloud and execution of H2O algorithms.

  1. Either download or build Sparkling Water

  2. Configure the location of Spark cluster:

    export SPARK_HOME="/path/to/spark/installation"
    export MASTER="local[*]"

    In this case, local[*] points to an embedded single node cluster.

  3. Run Sparkling Shell:

    bin/sparkling-shell

    Sparkling Shell accepts common Spark Shell arguments. For example, to increase memory allocated by each executor, use the spark.executor.memory parameter: bin/sparkling-shell --conf "spark.executor.memory=4g"

  4. Initialize H2OContext

    import org.apache.spark.h2o._
    val hc = H2OContext.getOrCreate(spark)

    H2OContext starts H2O services on top of Spark cluster and provides primitives for transformations between H2O and Spark data structures.

Use Sparkling Water with PySpark

Sparkling Water can be also used directly from PySpark and the integration is called PySparkling.

See PySparkling README to learn about PySparkling.

Use Sparkling Water via Spark Packages

To see how Sparkling Water can be used as Spark package, please see Use as Spark Package.

Use Sparkling Water in Windows environments

See Windows Tutorial to learn how to use Sparkling Water in Windows environments.

Sparkling Water examples

To see how to run examples for Sparkling Water, please see Running Examples.


Sparkling Water Backends

Sparkling water supports two backend/deployment modes - internal and external. Sparkling Water applications are independent on the selected backend. The backend can be specified before creation of the H2OContext.

For more details regarding the internal or external backend, please see Backends.


FAQ

List of all Frequently Asked Questions is available at FAQ.


Development

Complete development documentation is available at Development Documentation.

Build Sparkling Water

To see how to build Sparkling Water, please see Build Sparkling Water.

Develop applications with Sparkling Water

An application using Sparkling Water is regular Spark application which bundling Sparkling Water library. See Sparkling Water Droplet providing an example application here.

Contributing

Look at our list of JIRA tasks for new contributors or send your idea to support@h2o.ai.

Filing Bug Reports and Feature Requests

You can file a bug report of feature request directly in the Sparkling Water JIRA page at http://jira.h2o.ai/.

  1. Log in to the Sparkling Water JIRA tracking system. (Create an account if necessary.)

  2. Once inside the home page, click the Create button.

    center
  3. A form will display allowing you to enter information about the bug or feature request.

    center

    Enter the following on the form:

    • Select the Project that you want to file the issue under. For example, if this is an open source public bug, you should file it under SW (SW).
    • Specify the Issue Type. For example, if you believe you've found a bug, then select Bug, or if you want to request a new feature, then select New Feature.
    • Provide a short but concise summary about the issue. The summary will be shown when engineers organize, filter, and search for Jira tickets.
    • Specify the urgency of the issue using the Priority dropdown menu.
    • If there is a due date specify it with the Due Date.
    • The Components drop down refers to the API or language that the issue relates to. (See the drop down menu for available options.)
    • You can leave Affects Version/s, Fix Versions, and Assignee fields blank. Our engineering team will fill this in.
    • Add a detailed description of your bug in the Description section. Best practice for descriptions include:
    • A summary of what the issue is
    • What you think is causing the issue
    • Reproducible code that can be run end to end without requiring an engineer to edit your code. Use {code} {code} around your code to make it appear in code format.
    • Any scripts or necessary documents. Add by dragging and dropping your files into the create issue dialogue box.

    You can be able to leave the rest of the ticket blank.

  4. When you are done with your ticket, simply click on the Create button at the bottom of the page.

    center

After you click Create, a pop up will appear on the right side of your screen with a link to your Jira ticket. It will have the form https://0xdata.atlassian.net/browse/SW-####. You can use this link to later edit your ticket.

Please note that your Jira ticket number along with its summary will appear in one of the Jira ticket slack channels, and anytime you update the ticket anyone associated with that ticket, whether as the assignee or a watcher will receive an email with your changes.

Have Questions?

We also respond to questions tagged with sparkling-water and h2o tags on the Stack Overflow.

Change Logs

Change logs are available at Change Logs.