Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R + sparkling-water (H2O/Spark) #8

Closed
hmaeda opened this issue May 5, 2015 · 26 comments
Closed

R + sparkling-water (H2O/Spark) #8

hmaeda opened this issue May 5, 2015 · 26 comments

Comments

@hmaeda
Copy link

hmaeda commented May 5, 2015

All of the demos/examples on the README.md seem to be doing all the prediction code in scala and only has R as an after thought using the residualPlotRCode function from here to visualise in R. Scala is on my "to learn" list , but in the mean time... Given h20's close connections with R, is it possible to see/have a sparkling water example/demo with R as the interface? and ideally with an EC2 example too to illustrate the benefits of distributed parallel computing? Maybe it might need to use SparkR? or something else? I'm not sure...

@mmalohlava
Copy link
Member

Yes! In fact Sparkling Water provides H2O on top of Spark cluster. So you can access all H2O services - including R, Python interfaces. Look into H2O-DEV project to see examples of R code - https://github.com/h2oai/h2o-dev/tree/master/h2o-r/tests

The deployment into EC2 environment depends on Spark infrastructure. You need Spark cluster to be running and then you just submit sparkling water to the cluster.

I will leave this issue open and try to provide more examples involving cooperation with R.

@hmaeda
Copy link
Author

hmaeda commented May 8, 2015

Where exactly should I be looking in the h20 tests to see how Spark is being used under the hood for h2o algos in R...

@tomkraljevic
Copy link
Contributor

Hi,

H2O algos are written from scratch in H2O (see the h2o-dev github).
We are not currently using Spark to implement H2O algos.

The examples and tests mostly use Spark for data selection and preprocessing (e.g. Spark SQL).
Then H2O algos are called.
You could also call MLlib algos if you wish.

Sparkling water has both h2o-dev and Spark as a maven dependency.

Thanks,
Tom

On May 8, 2015, at 3:41 AM, hmaeda notifications@github.com wrote:

Where exactly should I be looking in the h20 tests to see how Spark is being used under the hood for h2o algos...


Reply to this email directly or view it on GitHub.

@binga
Copy link

binga commented Jun 11, 2015

Hi,

I am interested to use h2o algorithms (like RF/GBM) to perform a classification task on a dataset loaded in Spark. Scala is on my "to-learn" list but at this point of time, I would like to use R. Is it possible to write R code that calls h2o algorithms on data in spark? If yes, if you guys can produce an example (like the ones you have provided using Scala) that would be immensely helpful!

Thanks :-)

@phanisrinath
Copy link

Even I seem to be having the same doubt. The Sparkling Water FAQ says that it is possible but that doesnt seem to be reflecting in any of the examples. It would be great if you guys could clarify to us on the above query.

Thanks in Advance :)

@mmalohlava
Copy link
Member

@binga yes, you can connect from R to running H2O/Sparkling water cluster and run algos, analyze data, or do feature munging. See docs.h2o.ai for R-example

Also you can look at some example in Sparkling Water incorporating R:

The main point is that, if you have running H2O/Sparkling Water, you can combine different clients to drive computation - prepare data in Sparkling Water+Spark, build model from R, analyze predicted results from Flow UI.

@mmalohlava
Copy link
Member

@phanisrinath please look at examples posted above, or come to our Sparkling Water meetups to see the interoperability

@hmaeda
Copy link
Author

hmaeda commented Jun 12, 2015

Will there be any integration with sparkR?

@mmalohlava
Copy link
Member

I am not sure if it would be useful right now, our R approach is totally different from SparkR, we are focused on being transparent for R users and distributing regular R-operations in backend.

However, we can probably expose H2OContext primitives inside sparkR.

@hmaeda
Copy link
Author

hmaeda commented Jun 18, 2015

Now that Spark is supported by EMR as suggested here and here, would it be possible to request a tutorial on how to get R, H2O/(Sparkling Water), and Spark working with EMR with the given AMIs? Or are there other AMIs that you would recommend?

@mmalohlava
Copy link
Member

Hi,

right now we need to test Sparkling water in EMR environment.

But since Sparkling Water is just a jar which needs to be passed via Spark --jar option to
spark-submit I expect that that integration will be easy.
To use R with it, you just need to point R's h2o.init to IP/PORT of the cluster.

michal

Dne 6/18/15 v 3:54 AM hmaeda napsal(a):

Now that Spark is supported by EMR as suggested here
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark and here
https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html,
would it be possible to request a tutorial on how to get R and H2O working with EMR with the given
AMIs? Or are there other AMIs that you would recommend?


Reply to this email directly or view it on GitHub
#8 (comment).

@nsharkey
Copy link

Add me to the list. I have many R models based on R packages that I do not want to convert to h2o. I have yet to find a simple walkthrough to take current R models and port them through h2o.

If it's not possible that's cool, but it seems like posts suggest it is but no one shows how it's done.

@mmalohlava
Copy link
Member

@bigfantasyfootball can you provide more details?What kind of models do you have in R?

@nsharkey
Copy link

Hi @mmalohlava thanks for inquiring. If I could see how the ridiculously simple example below works using h2o I could figure out the rest.

library(caret)
data(iris)

rf <- train(Species ~ ., data = iris, method = "rf")

@alexshires
Copy link

Hiya,
I'm also interested in integrating these technologies - can you simply pooint sparkling water at ./bin/sparkR instead of /bin/spark? is there a compilation option to change the executable?
Alex

@mmalohlava
Copy link
Member

Hi Alex,

can you more elaborate on your idea? It sounds interesting but I am not sure what you mean by
changing /bin/spark to /bin/sparkR.
Just to clarify: Sparkling water is build on top of spark as an application, so it depends on Spark
infrastructure and utilities (spark-submit, spark-shell).

Best regards,
Michal

On 9/29/15 4:10 AM, Alex Shires wrote:

Hiya,
I'm also interested in integrating these technologies - can you simply pooint sparkling water at
./bin/sparkR instead of /bin/spark? is there a compilation option to change the executable?
Alex


Reply to this email directly or view it on GitHub
#8 (comment).

@alexshires
Copy link

Hi Michal,

Yes, I was confused! SparkR is an R package that imports the spark library, methods etc into a R shell. As far as I understand it, sparkling water builds the H20 layer/interface on top of spark - so if one adds the right dependencies to the SBT files in sparkR for H2O, then will it be able to link against both?

I'll need to dig a bit further to look into creating R packages....

Alex

@alexshires
Copy link

Hi Michal,

as far as I understand it, in order to get Sparkling Water working within R, we'll need to write wrappers using the Spark<->R API for the functions we care about - or expose all the primitives for future developers. The other solution would be to write scala functions including the Sparkling Water and then wrap them in the R API to expose a much more simple level of information transfer.

Which of these sounds easier to you? I'm tempted to go for the second one - write a scala package that depends on Sparkling Water, using the machine learning I need from H2O and then write an R wrapper and try to include it so it gets built with SparkR. It's not a trivial problem....

Regards,

Alex

@mmalohlava
Copy link
Member

Hi Alex,

yes, you are right! And we are working on it (but right now for Python and pySpark).

So let me explain, how H2O's R works - H2O exposes REST API which exposes capabilities provided by
Java API.
On the top of REST API, we built several clients including R/Python/Flow UI.

So if you import H2O's R package into your R session ( library(h2o) - it is available via CRAN),
we override few functions, but to make it work, you have to connect your R client to an existing H2O
cluster via client <- h2o.init(ip=..., port=...)
From this point you can use all capabilities provided by h2o package
(http://h2o-release.s3.amazonaws.com/h2o/rel-slater/5/docs-website/h2o-r/h2o_package.pdf).

The trick is that if you run H2O on top of Spark (as Sparkling Water), you have access to the same
REST API.
So you can connect to it from your local machine, but even more you can connect to it also from
sparkR code (in theory, but it is one of our goals for next months).

Does it make sense?

Michal

On 9/29/15 9:55 AM, Alex Shires wrote:

Hi Michal,

Yes, I was confused! SparkR is an R package that imports the spark library, methods etc into a R
shell. As far as I understand it, sparkling water builds the H20 layer/interface on top of spark -
so if one adds the right dependencies to the SBT files in sparkR for H2O, then will it be able to
link against both?

I'll need to dig a bit further to look into creating R packages....

Alex


Reply to this email directly or view it on GitHub
#8 (comment).

@hmaeda
Copy link
Author

hmaeda commented Oct 7, 2015

Hi Michal,

Is is possible to show a quick example, of loading the iris data set from R, converting it to an RDD, and uploading it into h2o in sparkling water as a frame? I know that there is the h2o R package that can do this (the as.h2o() function), but at the moment it is currently written to write to disk first before uploading a file. I am hoping that by using sparkling water there will be a process by which the data does not get written to disk first. (My actual data set is in RAM and is very large, and writing to disk first is an expensive operation.)

Furthermore, all of the examples I have seen seem to focus on uploading files that are already on disk and not on data that is already in memory/RAM into h2o, an example that shows data moving from in RAM (in R) to h2o would be appreciated, assuming that this is possible.

Regards,

Hiddi

@mmalohlava
Copy link
Member

Hi Hiddi,

so lets expect Spark with Sparkling Water is running - that means you created H2OContext and started it (in Spark shel or in a standalone application) by some code like this one:

import org.apache.spark.h2o._
val hc = new H2OContext(sc).start()

H2O context gives you address of entry point to access from R:

h2oContext: org.apache.spark.h2o.H2OContext =                                   

Sparkling Water Context:
 * number of executors: 3
 * list of used executors:
  (executorId, host, port)
  ------------------------
  (0,michals-mbp.0xdata.loc,54327)
  (1,michals-mbp.0xdata.loc,54331)
  (2,michals-mbp.0xdata.loc,54323)
  ------------------------

  Open H2O Flow in browser: http://172.16.2.223:54321 (CMD + click in Mac OSX)

So you can now use R to connect to a cluster and load data:

library(h2o)

h = h2o.init(ip="172.16.2.223", port="54321")
iris.hex <- as.h2o(iris)

Now you should see in Flow UI (open http://172.16.2.223:54321) iris.hex dataset.

However, as you mention it will write file to a disk. BUT you can use h2o.uploadFile or h2o.importFile from HDFS if it is more handy. Right now we do not have support to upload in-memory data but technically i can image a solution - just stream data to h2o, not big deal.
If you are interested please file a jira for your use-case (http://jira.h2o.ai)

btw: would tachyon help you? We have Tachyon support disabled but it should be easy to enable it back.

@hmaeda
Copy link
Author

hmaeda commented Oct 7, 2015

Hi Michal,

Support for upload in-memory data would be amazing, and I am also very interested in the concept of streaming data to h2o too! Not too sure how to 'file a jira' as you suggest. But my use-case would be that, I regularly receive several packets of a smallish datasets, these are loaded into R's memory/environment, so that lots of additional/statistical features can be added/calculated/created adding many columns to the original data. This is then becomes very large and a very expensive operation to write the data to disk, before upload into h2o's environment. Therefore I would like to have an in-memory process of uploading that data from R's memory to h2o's environment. (Ideally at memory speed...)

How would data from R be streamed into h2o? Would it need something like kafka? by perhaps using something like the rkafka package? Or would the data need to be in spark first?

Also would perhaps the rscala package? be useful in getting data into scala to convert to a spark dataframe? to be sent to h2o?

Separately, just so that I understand how the process works correctly? How does data transfer from spark to h2o in a sparkling water setup? is it written to disk as well?

Had not heard of Tachyon until you mentioned it, but from a quick google, the concept of Tachyon sounds amazing. Not too sure how to implement it for uploading data from R to h2o though? Some simple examples of how to use Tachyon for h2o would be much appreciated.

Regards,

Hiddi

@Will-Hardman
Copy link

Hi Michal,

I have been following this conversation and I'm interested in the Tachyon support. We are about to deploy a Tachyon + Spark cluster for a new project, and I'd love to trial H20 for the machine-learning components of the project.

Is this currently possible? Using PySparkling Water, for example, could I do the following:

hc = H2OContext(sc).start()
hc.textFile("tachyon://path_to_file")

Or am I missing something?

With thanks,

Will

@ta269uec
Copy link

Hello Michal

We would like to support the following use case:

  • Create/load a spark data-frame within R (using sparkR).
  • Modify this data frame. Then we want to apply h2o machine learning methods on this data-frame. We therefore would like to upload this data-frame to h2o from spark. We do not wish to use disk as an intermediary for this purpose.

Do we have any example script which would demonstrate the above? Michal suggests the following " "just stream data to h2o, not big deal." - do we have any code-snippet which can demo this idea?

Thanks very much

@manubatham20
Copy link

manubatham20 commented Jul 29, 2016

Hello Michal,

I am initiating Spark using SparkR on Hadoop cluster (deploy-mode=yarn-client)

Sys.setenv(HADOOP_CONF_DIR = "/etc/hadoop/cloudera-prod/conf.cloudera.yarn")
Sys.setenv(SPARK_HOME = "/home/softs/spark-1.6.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init(master="yarn-client")
# pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
# for (pkg in pkgs) {
#   if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
# }
# 
# # Now we download, install and initialize the H2O package for R.
# install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turin/4/R")))
library(h2o)

I want to run H2O on that SparkContext without going out from R environment. Can I run anything like below in R/SparkR

hc = H2OContext(sc).start()
Thanks,
Manu

@jangorecki jangorecki changed the title R + sparkling-water (H20/Spark) R + sparkling-water (H2O/Spark) Sep 30, 2016
@jakubhava
Copy link
Contributor

Please refer to https://github.com/h2oai/rsparkling repo where rsparkling is currently located for more information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests