If you have connected to H2O from RStudio before, the process for connecting to Sparkling Water from RStudio is very similar.
Before starting, verify R, RStudio, and Sparkling Water are installed.
export SPARK_HOME="/path/to/spark/installation"
export MASTER="local-cluster[3,2,1024]"
bin/sparkling-shell
To view the Sparkling Shell status, go to http://localhost:4040/.
import org.apache.spark.h2o._
val h2oContext = new H2OContext(sc).start()
import h2oContext._
The last line of the output (appearing above the scala
command prompt in the screenshot above) identifies the IP and port number of the H2O cluster. Copy these numbers to use in the next step.
In RStudio, use the IP and port number specified in the output from the previous step in the h2o.init()
call:
The Spark DataFrame can then be published as an H2OFrame and accessed in R.
In Sparkling Shell:
val df = sc.parallelize(1 to 100).toDF // creates Spark DataFrame
val hf = h2oContext.asH2OFrame(df) // publishes DataFrame as H2O's Frame
In the output in the screenshot above, the second line below the highlighted line displays the name of the published frame (frame_rdd_6
).
View all frames available in RStudio using h2o.ls()
:
The frame can now be used in RStudio (for example, as shown in the screenshot below, using h2o.getFrame
).