New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R + sparkling-water (H2O/Spark) #8
Comments
Yes! In fact Sparkling Water provides H2O on top of Spark cluster. So you can access all H2O services - including R, Python interfaces. Look into H2O-DEV project to see examples of R code - https://github.com/h2oai/h2o-dev/tree/master/h2o-r/tests The deployment into EC2 environment depends on Spark infrastructure. You need Spark cluster to be running and then you just submit sparkling water to the cluster. I will leave this issue open and try to provide more examples involving cooperation with R. |
Where exactly should I be looking in the h20 tests to see how Spark is being used under the hood for h2o algos in R... |
Hi, H2O algos are written from scratch in H2O (see the h2o-dev github). The examples and tests mostly use Spark for data selection and preprocessing (e.g. Spark SQL). Sparkling water has both h2o-dev and Spark as a maven dependency. Thanks,
|
Hi, I am interested to use h2o algorithms (like RF/GBM) to perform a classification task on a dataset loaded in Spark. Scala is on my "to-learn" list but at this point of time, I would like to use R. Is it possible to write R code that calls h2o algorithms on data in spark? If yes, if you guys can produce an example (like the ones you have provided using Scala) that would be immensely helpful! Thanks :-) |
Even I seem to be having the same doubt. The Sparkling Water FAQ says that it is possible but that doesnt seem to be reflecting in any of the examples. It would be great if you guys could clarify to us on the above query. Thanks in Advance :) |
@binga yes, you can connect from R to running H2O/Sparkling water cluster and run algos, analyze data, or do feature munging. See docs.h2o.ai for R-example Also you can look at some example in Sparkling Water incorporating R:
The main point is that, if you have running H2O/Sparkling Water, you can combine different clients to drive computation - prepare data in Sparkling Water+Spark, build model from R, analyze predicted results from Flow UI. |
@phanisrinath please look at examples posted above, or come to our Sparkling Water meetups to see the interoperability |
Will there be any integration with sparkR? |
I am not sure if it would be useful right now, our R approach is totally different from SparkR, we are focused on being transparent for R users and distributing regular R-operations in backend. However, we can probably expose H2OContext primitives inside sparkR. |
Hi, right now we need to test Sparkling water in EMR environment. But since Sparkling Water is just a jar which needs to be passed via Spark --jar option to michal Dne 6/18/15 v 3:54 AM hmaeda napsal(a):
|
Add me to the list. I have many R models based on R packages that I do not want to convert to h2o. I have yet to find a simple walkthrough to take current R models and port them through h2o. If it's not possible that's cool, but it seems like posts suggest it is but no one shows how it's done. |
@bigfantasyfootball can you provide more details?What kind of models do you have in R? |
Hi @mmalohlava thanks for inquiring. If I could see how the ridiculously simple example below works using h2o I could figure out the rest. library(caret) rf <- train(Species ~ ., data = iris, method = "rf") |
Hiya, |
Hi Alex, can you more elaborate on your idea? It sounds interesting but I am not sure what you mean by Best regards, On 9/29/15 4:10 AM, Alex Shires wrote:
|
Hi Michal, Yes, I was confused! SparkR is an R package that imports the spark library, methods etc into a R shell. As far as I understand it, sparkling water builds the H20 layer/interface on top of spark - so if one adds the right dependencies to the SBT files in sparkR for H2O, then will it be able to link against both? I'll need to dig a bit further to look into creating R packages.... Alex |
Hi Michal, as far as I understand it, in order to get Sparkling Water working within R, we'll need to write wrappers using the Spark<->R API for the functions we care about - or expose all the primitives for future developers. The other solution would be to write scala functions including the Sparkling Water and then wrap them in the R API to expose a much more simple level of information transfer. Which of these sounds easier to you? I'm tempted to go for the second one - write a scala package that depends on Sparkling Water, using the machine learning I need from H2O and then write an R wrapper and try to include it so it gets built with SparkR. It's not a trivial problem.... Regards, Alex |
Hi Alex, yes, you are right! And we are working on it (but right now for Python and pySpark). So let me explain, how H2O's R works - H2O exposes REST API which exposes capabilities provided by So if you import H2O's R package into your R session ( The trick is that if you run H2O on top of Spark (as Sparkling Water), you have access to the same Does it make sense? Michal On 9/29/15 9:55 AM, Alex Shires wrote:
|
Hi Michal, Is is possible to show a quick example, of loading the iris data set from R, converting it to an RDD, and uploading it into h2o in sparkling water as a frame? I know that there is the h2o R package that can do this (the as.h2o() function), but at the moment it is currently written to write to disk first before uploading a file. I am hoping that by using sparkling water there will be a process by which the data does not get written to disk first. (My actual data set is in RAM and is very large, and writing to disk first is an expensive operation.) Furthermore, all of the examples I have seen seem to focus on uploading files that are already on disk and not on data that is already in memory/RAM into h2o, an example that shows data moving from in RAM (in R) to h2o would be appreciated, assuming that this is possible. Regards, Hiddi |
Hi Hiddi, so lets expect Spark with Sparkling Water is running - that means you created import org.apache.spark.h2o._
val hc = new H2OContext(sc).start() H2O context gives you address of entry point to access from R:
So you can now use R to connect to a cluster and load data: library(h2o)
h = h2o.init(ip="172.16.2.223", port="54321")
iris.hex <- as.h2o(iris) Now you should see in Flow UI (open http://172.16.2.223:54321) However, as you mention it will write file to a disk. BUT you can use btw: would tachyon help you? We have Tachyon support disabled but it should be easy to enable it back. |
Hi Michal, Support for upload in-memory data would be amazing, and I am also very interested in the concept of streaming data to h2o too! Not too sure how to 'file a jira' as you suggest. But my use-case would be that, I regularly receive several packets of a smallish datasets, these are loaded into R's memory/environment, so that lots of additional/statistical features can be added/calculated/created adding many columns to the original data. This is then becomes very large and a very expensive operation to write the data to disk, before upload into h2o's environment. Therefore I would like to have an in-memory process of uploading that data from R's memory to h2o's environment. (Ideally at memory speed...) How would data from R be streamed into h2o? Would it need something like kafka? by perhaps using something like the rkafka package? Or would the data need to be in spark first? Also would perhaps the rscala package? be useful in getting data into scala to convert to a spark dataframe? to be sent to h2o? Separately, just so that I understand how the process works correctly? How does data transfer from spark to h2o in a sparkling water setup? is it written to disk as well? Had not heard of Tachyon until you mentioned it, but from a quick google, the concept of Tachyon sounds amazing. Not too sure how to implement it for uploading data from R to h2o though? Some simple examples of how to use Tachyon for h2o would be much appreciated. Regards, Hiddi |
Hi Michal, I have been following this conversation and I'm interested in the Tachyon support. We are about to deploy a Tachyon + Spark cluster for a new project, and I'd love to trial H20 for the machine-learning components of the project. Is this currently possible? Using PySparkling Water, for example, could I do the following: hc = H2OContext(sc).start() Or am I missing something? With thanks, Will |
Hello Michal We would like to support the following use case:
Do we have any example script which would demonstrate the above? Michal suggests the following " "just stream data to h2o, not big deal." - do we have any code-snippet which can demo this idea? Thanks very much |
Hello Michal, I am initiating Spark using SparkR on Hadoop cluster (deploy-mode=yarn-client)
I want to run H2O on that SparkContext without going out from R environment. Can I run anything like below in R/SparkR
|
Please refer to https://github.com/h2oai/rsparkling repo where rsparkling is currently located for more information |
All of the demos/examples on the README.md seem to be doing all the prediction code in scala and only has R as an after thought using the
residualPlotRCode
function from here to visualise in R. Scala is on my "to learn" list , but in the mean time... Given h20's close connections with R, is it possible to see/have a sparkling water example/demo with R as the interface? and ideally with an EC2 example too to illustrate the benefits of distributed parallel computing? Maybe it might need to use SparkR? or something else? I'm not sure...The text was updated successfully, but these errors were encountered: