R + sparkling-water (H2O/Spark) #8

hmaeda · 2015-05-05T17:21:48Z

All of the demos/examples on the README.md seem to be doing all the prediction code in scala and only has R as an after thought using the residualPlotRCode function from here to visualise in R. Scala is on my "to learn" list , but in the mean time... Given h20's close connections with R, is it possible to see/have a sparkling water example/demo with R as the interface? and ideally with an EC2 example too to illustrate the benefits of distributed parallel computing? Maybe it might need to use SparkR? or something else? I'm not sure...

The text was updated successfully, but these errors were encountered:

mmalohlava · 2015-05-05T17:52:27Z

Yes! In fact Sparkling Water provides H2O on top of Spark cluster. So you can access all H2O services - including R, Python interfaces. Look into H2O-DEV project to see examples of R code - https://github.com/h2oai/h2o-dev/tree/master/h2o-r/tests

The deployment into EC2 environment depends on Spark infrastructure. You need Spark cluster to be running and then you just submit sparkling water to the cluster.

I will leave this issue open and try to provide more examples involving cooperation with R.

hmaeda · 2015-05-08T10:41:29Z

Where exactly should I be looking in the h20 tests to see how Spark is being used under the hood for h2o algos in R...

tomkraljevic · 2015-05-08T15:40:18Z

Hi,

H2O algos are written from scratch in H2O (see the h2o-dev github).
We are not currently using Spark to implement H2O algos.

The examples and tests mostly use Spark for data selection and preprocessing (e.g. Spark SQL).
Then H2O algos are called.
You could also call MLlib algos if you wish.

Sparkling water has both h2o-dev and Spark as a maven dependency.

Thanks,
Tom

On May 8, 2015, at 3:41 AM, hmaeda notifications@github.com wrote:

Where exactly should I be looking in the h20 tests to see how Spark is being used under the hood for h2o algos...

—
Reply to this email directly or view it on GitHub.

binga · 2015-06-11T18:28:23Z

Hi,

I am interested to use h2o algorithms (like RF/GBM) to perform a classification task on a dataset loaded in Spark. Scala is on my "to-learn" list but at this point of time, I would like to use R. Is it possible to write R code that calls h2o algorithms on data in spark? If yes, if you guys can produce an example (like the ones you have provided using Scala) that would be immensely helpful!

Thanks :-)

phanisrinath · 2015-06-11T19:05:00Z

Even I seem to be having the same doubt. The Sparkling Water FAQ says that it is possible but that doesnt seem to be reflecting in any of the examples. It would be great if you guys could clarify to us on the above query.

Thanks in Advance :)

mmalohlava · 2015-06-12T16:28:25Z

@binga yes, you can connect from R to running H2O/Sparkling water cluster and run algos, analyze data, or do feature munging. See docs.h2o.ai for R-example

Also you can look at some example in Sparkling Water incorporating R:

Prepare data/models in Spark/Sparkling Water and use them from R https://github.com/h2oai/sparkling-water/blob/master/examples/meetups/Meetup20150326.md
Analysis on airline data, making a regression model in Sparkling Water, and producing residuals plot in R. See code https://github.com/h2oai/sparkling-water/blob/master/examples/meetups/Meetup20150203.md

The main point is that, if you have running H2O/Sparkling Water, you can combine different clients to drive computation - prepare data in Sparkling Water+Spark, build model from R, analyze predicted results from Flow UI.

mmalohlava · 2015-06-12T16:45:04Z

@phanisrinath please look at examples posted above, or come to our Sparkling Water meetups to see the interoperability

hmaeda · 2015-06-12T16:52:34Z

Will there be any integration with sparkR?

mmalohlava · 2015-06-12T16:54:33Z

I am not sure if it would be useful right now, our R approach is totally different from SparkR, we are focused on being transparent for R users and distributing regular R-operations in backend.

However, we can probably expose H2OContext primitives inside sparkR.

hmaeda · 2015-06-18T10:54:38Z

Now that Spark is supported by EMR as suggested here and here, would it be possible to request a tutorial on how to get R, H2O/(Sparkling Water), and Spark working with EMR with the given AMIs? Or are there other AMIs that you would recommend?

mmalohlava · 2015-06-18T23:44:34Z

Hi,

right now we need to test Sparkling water in EMR environment.

But since Sparkling Water is just a jar which needs to be passed via Spark --jar option to
spark-submit I expect that that integration will be easy.
To use R with it, you just need to point R's h2o.init to IP/PORT of the cluster.

michal

Dne 6/18/15 v 3:54 AM hmaeda napsal(a):

Now that Spark is supported by EMR as suggested here
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark and here
https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html,
would it be possible to request a tutorial on how to get R and H2O working with EMR with the given
AMIs? Or are there other AMIs that you would recommend?

—
Reply to this email directly or view it on GitHub
#8 (comment).

nsharkey · 2015-07-16T00:32:38Z

Add me to the list. I have many R models based on R packages that I do not want to convert to h2o. I have yet to find a simple walkthrough to take current R models and port them through h2o.

If it's not possible that's cool, but it seems like posts suggest it is but no one shows how it's done.

mmalohlava · 2015-07-16T09:27:56Z

@bigfantasyfootball can you provide more details?What kind of models do you have in R?

nsharkey · 2015-07-16T12:37:27Z

Hi @mmalohlava thanks for inquiring. If I could see how the ridiculously simple example below works using h2o I could figure out the rest.

library(caret)
data(iris)

rf <- train(Species ~ ., data = iris, method = "rf")

alexshires · 2015-09-29T11:10:52Z

Hiya,
I'm also interested in integrating these technologies - can you simply pooint sparkling water at ./bin/sparkR instead of /bin/spark? is there a compilation option to change the executable?
Alex

mmalohlava · 2015-09-29T16:46:06Z

Hi Alex,

can you more elaborate on your idea? It sounds interesting but I am not sure what you mean by
changing /bin/spark to /bin/sparkR.
Just to clarify: Sparkling water is build on top of spark as an application, so it depends on Spark
infrastructure and utilities (spark-submit, spark-shell).

Best regards,
Michal

On 9/29/15 4:10 AM, Alex Shires wrote:

Hiya,
I'm also interested in integrating these technologies - can you simply pooint sparkling water at
./bin/sparkR instead of /bin/spark? is there a compilation option to change the executable?
Alex

—
Reply to this email directly or view it on GitHub
#8 (comment).

alexshires · 2015-09-29T16:55:22Z

Hi Michal,

Yes, I was confused! SparkR is an R package that imports the spark library, methods etc into a R shell. As far as I understand it, sparkling water builds the H20 layer/interface on top of spark - so if one adds the right dependencies to the SBT files in sparkR for H2O, then will it be able to link against both?

I'll need to dig a bit further to look into creating R packages....

Alex

alexshires · 2015-09-30T15:53:51Z

Hi Michal,

as far as I understand it, in order to get Sparkling Water working within R, we'll need to write wrappers using the Spark<->R API for the functions we care about - or expose all the primitives for future developers. The other solution would be to write scala functions including the Sparkling Water and then wrap them in the R API to expose a much more simple level of information transfer.

Which of these sounds easier to you? I'm tempted to go for the second one - write a scala package that depends on Sparkling Water, using the machine learning I need from H2O and then write an R wrapper and try to include it so it gets built with SparkR. It's not a trivial problem....

Regards,

Alex

mmalohlava · 2015-09-30T20:02:46Z

Hi Alex,

yes, you are right! And we are working on it (but right now for Python and pySpark).

So let me explain, how H2O's R works - H2O exposes REST API which exposes capabilities provided by
Java API.
On the top of REST API, we built several clients including R/Python/Flow UI.

So if you import H2O's R package into your R session ( library(h2o) - it is available via CRAN),
we override few functions, but to make it work, you have to connect your R client to an existing H2O
cluster via client <- h2o.init(ip=..., port=...)
From this point you can use all capabilities provided by h2o package
(http://h2o-release.s3.amazonaws.com/h2o/rel-slater/5/docs-website/h2o-r/h2o_package.pdf).

The trick is that if you run H2O on top of Spark (as Sparkling Water), you have access to the same
REST API.
So you can connect to it from your local machine, but even more you can connect to it also from
sparkR code (in theory, but it is one of our goals for next months).

Does it make sense?

Michal

On 9/29/15 9:55 AM, Alex Shires wrote:

Hi Michal,

Yes, I was confused! SparkR is an R package that imports the spark library, methods etc into a R
shell. As far as I understand it, sparkling water builds the H20 layer/interface on top of spark -
so if one adds the right dependencies to the SBT files in sparkR for H2O, then will it be able to
link against both?

I'll need to dig a bit further to look into creating R packages....

Alex

—
Reply to this email directly or view it on GitHub
#8 (comment).

hmaeda · 2015-10-07T15:46:32Z

Hi Michal,

Is is possible to show a quick example, of loading the iris data set from R, converting it to an RDD, and uploading it into h2o in sparkling water as a frame? I know that there is the h2o R package that can do this (the as.h2o() function), but at the moment it is currently written to write to disk first before uploading a file. I am hoping that by using sparkling water there will be a process by which the data does not get written to disk first. (My actual data set is in RAM and is very large, and writing to disk first is an expensive operation.)

Furthermore, all of the examples I have seen seem to focus on uploading files that are already on disk and not on data that is already in memory/RAM into h2o, an example that shows data moving from in RAM (in R) to h2o would be appreciated, assuming that this is possible.

Regards,

Hiddi

mmalohlava · 2015-10-07T17:45:16Z

Hi Hiddi,

so lets expect Spark with Sparkling Water is running - that means you created H2OContext and started it (in Spark shel or in a standalone application) by some code like this one:

import org.apache.spark.h2o._
val hc = new H2OContext(sc).start()

H2O context gives you address of entry point to access from R:

h2oContext: org.apache.spark.h2o.H2OContext =                                   

Sparkling Water Context:
 * number of executors: 3
 * list of used executors:
  (executorId, host, port)
  ------------------------
  (0,michals-mbp.0xdata.loc,54327)
  (1,michals-mbp.0xdata.loc,54331)
  (2,michals-mbp.0xdata.loc,54323)
  ------------------------

  Open H2O Flow in browser: http://172.16.2.223:54321 (CMD + click in Mac OSX)

So you can now use R to connect to a cluster and load data:

library(h2o)

h = h2o.init(ip="172.16.2.223", port="54321")
iris.hex <- as.h2o(iris)

Now you should see in Flow UI (open http://172.16.2.223:54321) iris.hex dataset.

However, as you mention it will write file to a disk. BUT you can use h2o.uploadFile or h2o.importFile from HDFS if it is more handy. Right now we do not have support to upload in-memory data but technically i can image a solution - just stream data to h2o, not big deal.
If you are interested please file a jira for your use-case (http://jira.h2o.ai)

btw: would tachyon help you? We have Tachyon support disabled but it should be easy to enable it back.

hmaeda · 2015-10-07T18:27:12Z

Hi Michal,

Support for upload in-memory data would be amazing, and I am also very interested in the concept of streaming data to h2o too! Not too sure how to 'file a jira' as you suggest. But my use-case would be that, I regularly receive several packets of a smallish datasets, these are loaded into R's memory/environment, so that lots of additional/statistical features can be added/calculated/created adding many columns to the original data. This is then becomes very large and a very expensive operation to write the data to disk, before upload into h2o's environment. Therefore I would like to have an in-memory process of uploading that data from R's memory to h2o's environment. (Ideally at memory speed...)

How would data from R be streamed into h2o? Would it need something like kafka? by perhaps using something like the rkafka package? Or would the data need to be in spark first?

Also would perhaps the rscala package? be useful in getting data into scala to convert to a spark dataframe? to be sent to h2o?

Separately, just so that I understand how the process works correctly? How does data transfer from spark to h2o in a sparkling water setup? is it written to disk as well?

Had not heard of Tachyon until you mentioned it, but from a quick google, the concept of Tachyon sounds amazing. Not too sure how to implement it for uploading data from R to h2o though? Some simple examples of how to use Tachyon for h2o would be much appreciated.

Regards,

Hiddi

Will-Hardman · 2016-02-04T17:06:38Z

Hi Michal,

I have been following this conversation and I'm interested in the Tachyon support. We are about to deploy a Tachyon + Spark cluster for a new project, and I'd love to trial H20 for the machine-learning components of the project.

Is this currently possible? Using PySparkling Water, for example, could I do the following:

hc = H2OContext(sc).start()
hc.textFile("tachyon://path_to_file")

Or am I missing something?

With thanks,

Will

ta269uec · 2016-04-25T21:38:51Z

Hello Michal

We would like to support the following use case:

Create/load a spark data-frame within R (using sparkR).
Modify this data frame. Then we want to apply h2o machine learning methods on this data-frame. We therefore would like to upload this data-frame to h2o from spark. We do not wish to use disk as an intermediary for this purpose.

Do we have any example script which would demonstrate the above? Michal suggests the following " "just stream data to h2o, not big deal." - do we have any code-snippet which can demo this idea?

Thanks very much

manubatham20 · 2016-07-29T19:55:45Z

Hello Michal,

I am initiating Spark using SparkR on Hadoop cluster (deploy-mode=yarn-client)

Sys.setenv(HADOOP_CONF_DIR = "/etc/hadoop/cloudera-prod/conf.cloudera.yarn")
Sys.setenv(SPARK_HOME = "/home/softs/spark-1.6.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="yarn-client")
# pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
# for (pkg in pkgs) {
#   if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
# }
# 
# # Now we download, install and initialize the H2O package for R.
# install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turin/4/R")))
library(h2o)

I want to run H2O on that SparkContext without going out from R environment. Can I run anything like below in R/SparkR

hc = H2OContext(sc).start()
Thanks,
Manu

jakubhava · 2017-04-17T21:22:40Z

Please refer to https://github.com/h2oai/rsparkling repo where rsparkling is currently located for more information

mmalohlava added the help wanted label Dec 9, 2015

jangorecki changed the title ~~R + sparkling-water (H20/Spark)~~ R + sparkling-water (H2O/Spark) Sep 30, 2016

jakubhava closed this as completed Apr 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R + sparkling-water (H2O/Spark) #8

R + sparkling-water (H2O/Spark) #8

hmaeda commented May 5, 2015

mmalohlava commented May 5, 2015

hmaeda commented May 8, 2015

tomkraljevic commented May 8, 2015

binga commented Jun 11, 2015

phanisrinath commented Jun 11, 2015

mmalohlava commented Jun 12, 2015

mmalohlava commented Jun 12, 2015

hmaeda commented Jun 12, 2015

mmalohlava commented Jun 12, 2015

hmaeda commented Jun 18, 2015

mmalohlava commented Jun 18, 2015

nsharkey commented Jul 16, 2015

mmalohlava commented Jul 16, 2015

nsharkey commented Jul 16, 2015

alexshires commented Sep 29, 2015

mmalohlava commented Sep 29, 2015

alexshires commented Sep 29, 2015

alexshires commented Sep 30, 2015

mmalohlava commented Sep 30, 2015

hmaeda commented Oct 7, 2015

mmalohlava commented Oct 7, 2015

hmaeda commented Oct 7, 2015

Will-Hardman commented Feb 4, 2016

ta269uec commented Apr 25, 2016

manubatham20 commented Jul 29, 2016 •

edited

jakubhava commented Apr 17, 2017

R + sparkling-water (H2O/Spark) #8

R + sparkling-water (H2O/Spark) #8

Comments

hmaeda commented May 5, 2015

mmalohlava commented May 5, 2015

hmaeda commented May 8, 2015

tomkraljevic commented May 8, 2015

binga commented Jun 11, 2015

phanisrinath commented Jun 11, 2015

mmalohlava commented Jun 12, 2015

mmalohlava commented Jun 12, 2015

hmaeda commented Jun 12, 2015

mmalohlava commented Jun 12, 2015

hmaeda commented Jun 18, 2015

mmalohlava commented Jun 18, 2015

nsharkey commented Jul 16, 2015

mmalohlava commented Jul 16, 2015

nsharkey commented Jul 16, 2015

alexshires commented Sep 29, 2015

mmalohlava commented Sep 29, 2015

alexshires commented Sep 29, 2015

alexshires commented Sep 30, 2015

mmalohlava commented Sep 30, 2015

hmaeda commented Oct 7, 2015

mmalohlava commented Oct 7, 2015

hmaeda commented Oct 7, 2015

Will-Hardman commented Feb 4, 2016

ta269uec commented Apr 25, 2016

manubatham20 commented Jul 29, 2016 • edited

jakubhava commented Apr 17, 2017

manubatham20 commented Jul 29, 2016 •

edited