New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sparkmagic proposal #2
Conversation
sparkmagic proposal
Super exciting! |
|
||
Scope: | ||
* Spark will be the first back end provided by these magics, but it could easily be extended to other big data back ends. | ||
* Remote Spark execution through a REST endpoint which allows for Scala, Python, and R support as of Sept 2015. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the deadline for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a deadline. It's just stating the set of languages currently supported by Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I gotcha
Fixing typo.
Alternatives we know of: | ||
|
||
* Combination of IPython, R kernel (or rpy2), and Scala kernel for an in-Spark-cluster Jupyter installation. This does not allow the user to point to different Spark clusters. It might also result in resource contention (CPU or memory) between the Jupyter installation and Spark. | ||
* IBM's Spark kernel does not provide a REST endpoint, requires the installation of Jupyter in the cluster, and does not create pandas dataframes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sound like all negatives! What are lessons to be learned from what the spark kernel currently has? Would that team want to get involved on this proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it would be nice to have links to all these projects...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I have learned more about the way that spark is usually typically used, it makes less and less sense to me. Having to run pyspark, sparkr, etc. directly on the spark cluster breaks so many of a good lessons of the modern web and distributed architectures. I think talking to spark over REST makes tons of sense and is a huge step forward. It nicely separates the usage of spark (just HTML requests!) from its installation and deployment. It also enables a much more flexible set of ways for integrating with Jupyter as kernels no longer have to be run directly on the spark nodes. I am hoping to get feedback from the folks who built the spark kernel though to see how this vision plays with what they are thinking. @vinomaster @parente
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IBM's Spark kernel does not provide a REST endpoint, requires the installation of Jupyter in the cluster, and does not create pandas dataframes.
Personally, I've always treated the "IBM Spark Kernel" as a Scala kernel for Jupyter that has all the pieces needed to talk to Spark from Scala, and gives you SparkContext by default when you run it. No more, no less. It's not really all that different from the other Jupyter language kernels IMHO.
I think talking to spark over REST makes tons of sense and is a huge step forward.
It is a step forward from having to run the entire Jupyter notebook server plus kernels on the same L3 network as the Spark workers. But keep in mind a pure REST API will likely not work well with Spark Streaming which is one of the huge draws of Spark.
It also enables a much more flexible set of ways for integrating with Jupyter as kernels no longer have to be run directly on the spark nodes.
Again, while a REST API is a nice simple first step toward remote access to a Spark cluster, there are potential benefits to keeping the kernels on / near the compute cluster but running the Notebook web server app remotely, namely:
- Compute driver (kernel) stays close to compute workers (Spark) stays close to (big) data
- You get streaming "for free" thanks to the Jupyter protocol and implementation (0mq + websockets)
- Enabling kernels to run remote from the Notebook server has other benefits beyond remote access to Spark
The last is the impetus for the potential Kernel Provisioner and Gateway API proposal. I've spent most my time so far documenting potential uses case for it, the third of which on that page could cover remote access to a Spark (or any) compute cluster. I'm still figuring out how to bring it forward (JEP? incubator?) and consulting with @rgbkrk, @freeman-lab, et al about how it fits with other efforts under way. That said, I think it will take some time to realize and so please don't see it as discouraging this magics + REST API proposal. I don't believe there needs to be one and only one way to get Jupyter to work with Spark.
/cc @lbustelo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ellisonbg @parente One thing maybe to add here... for the vanilla approach, item number 1, it's typical for the "driver", which constructs the execution graph and submits it to the Spark master, to be on the same machine as the "master", which is part of the cluster -- but this isn't necessary.
The driver can be on a different machine so long as it's addressable from the workers (see here)
In our own deployments, we used to run driver and master on the same machine, but due to resource conflict concerns, we started running the driver on a different machine. This solves the resource issue, and otherwise works the same -- just start up a Jupyter notebook and create a SparkContext. And we regularly use this to talk to multiple clusters.
I generally agree with @ellisonbg that there's a lot to like about the RESTful model, but just wanted to add this to the discussion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the last thing I would say tonight cause I need to pay attention to the football game... No sure it is fair to put the pandas data frame statement on the spark kernel. First, it is a Scala kernel so of course it does not have a pandas data frame, but the same can be said about all the other Jupyter kernels that are not Python based. Let's remember the this is not ipython, it is Jupyter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome discussion! Thanks so much for your contributions.
Let me first try to address how Spark works to get to a shared understanding. If I've misunderstood something, please let me know.
The spark driver is the program that holds the spark context and creates the metadata around the RDDs, which are then evaluated by worker nodes. This driver DOES need network access to nodes in the cluster.
So, as long as you have an executable that has network access to the master in the cluster, for the drivers it creates, and can ask the master to do work, you'll be able to use Spark. This executable could be spark-submit, the pyspark shell, Livy, an IPython kernel, a Scala kernel, IBM's Spark kernel...
The architecture we are proposing for these magics is one such that the magics would talk to a REST endpoint (we are thinking Livy), that can create different drivers for the user and could fetch results for the user. You send a string of [python, scala, R] code that Livy relays to the Spark driver it created and start getting the results back from Livy over http. You could certainly choose to have a remote kernel that creates your drivers for you and use 0mq, but it's our believe that using a REST endpoint would extend better for other apps other than Jupyter kernels.
The benefits of having a remote installation of Jupyter that is able to connect to different clusters by virtue of changing the URL endpoint is manyfold:
- Your notebooks are valuable by themselves, without the need to have a cluster running. You can look at them and play with them if the cluster is gone.
- Any number of users could point to the same endpoint from their notebooks. They might have custom installs of Jupyter or other kernels running for them.
- Resource contention between Jupyter's kernels and Spark is gone. There might be resource contention between the driver and the master too but that's a different topic, and @freeman-lab 's solution definitely works.
There are challenges with the remote spark submission scenario, like figuring out the right amount of data from the result set to bring back to the client via the wire (regardless of the protocol used). Is it a sample or the top of the result set? I believe we'll have to work through these challenges regardless of the implementation chosen for the remote submission scenario.
Now, the beauty of doing something like this is that, under the hood, this is all http requests via Python. The user can be typing Scala or R code that gets executed in the cluster, but locally, it's all Python code and there's no need to improve multiple kernels to support integration with different languages for Spark. It is the purpose of these magics to allow users to do automatic rich visualizations of their computations in Spark (think Zeppelin, which is written in Scala but supports different languages) by integrating with the library that @ellisonbg is writing a proposal for. Advanced users might want to interact with the raw pandas dataframes that the kernel is using to visualize the data, but that's up to them. Canonical Spark users would do their computations in the cluster (keeping data and compute close together) and only retrieve the end result of their computations.
Again, loving the conversation, and I'm very eager to hear your feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aggFTW Thanks for the clarification. It helps frame the context. I think the proposal might benefit from some of what you just said ultimately winding up in it.
One thing and then I'll stop for the night ...
This driver DOES need network access to nodes in the cluster.
The above is true but ...
So, as long as you have an executable that has network access to the master in the cluster, for the drivers it creates, and can ask the master to do work, you'll be able to use Spark.
This is necessary but not sufficient.
To be 100% clear, the workers in the Spark cluster need access to the driver as well. By this I mean the driver is also a server with a set of ports listening for connections from Spark executors which are also clients. Every worker in the Spark cluster must be able to establish a network connection back to the driver.
Sounds crazy, I know, and that's why I'm trying to emphasize it. It's just how Spark works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I'm aware of that =) Thanks for clarifying it!
I'll work on an update for the proposal on Monday so that it's easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome discussion everyone! Super helpful, and will be great to see some of this in the proposal, especially this distinction from @aggFTW , which seems key
It is the purpose of these magics to allow users to do automatic rich visualizations of their computations in Spark ... Canonical Spark users would do their computations in the cluster (keeping data and compute close together) and only retrieve the end result of their computations.
We've often had no choice but to aggregate large results with Spark + Jupyter (esp true in machine learning applications), and I've worried about separation making that even worse. But the magics could definitely target use cases where smart summaries make it far less of an issue.
I guess what I'm missing is what is this rest api? As far as I know, there is no spark rest api, although if there was one it would be awesome. How is this rest API going to consume lambda functions in the variety of languages that spark and Jupyter supports? How much of a task would it be to keep this api in line with spark's? Map Reduce, streaming, mlib, graph, etc... @parente proposal of kernel provisioned and remote kernels is more in line with the user experience of the notebook user. Just connect to a kernel that supports spark. |
Good point! I had it in my head that the server-side implementation here is still Livy as per the google group discussion. But you're right, I don't see Livy mentioned in the proposal. |
Livy feels like an odd choice. If you think about it, Livy is yet another competing REPL environment with yet another protocol. It would be analogous to having a Jupyter kernel talking to another "kernel". I'm writing code in a notebook cell that is send to kernel 1 just to get package in a REST envelope to get sent and executed by Livy's "kernel". I guess it works, but odd. If anything, Jupyter should learn from newcomers like Livy, Zepellin and others and think about adopting alternative options for kernel communication (web sockets anyone?) |
That's how js frontend communicate to kernels. Tornado is "just" a websocket/ZMQ bridge. |
Thanks everyone for the comments. I didn't realize that the drive also Even though in principle it is possible to run the driver elsewhere, in I do think that the remote kernel APIs that folks are working on will free The data locality aspect is important and talking to spark over Livy There are other some other pros and cons of the two approaches:
Because these things and the other's mentioned above, both modes (direct Alejandro, it would be great to add links to Livy and the other things you On Sat, Sep 12, 2015 at 9:19 PM, Matthias Bussonnier <
Brian E. Granger |
Couple more comments about various aspects of the proposal and discussion above.
I would add Livy as your leading candidate for the implementation to the proposal.
Right, but now there's a dependency on ensuring that Livy, or whatever REST endpoint is chosen, continues to be maintained external to the Jupyter community for these magics to work. I think it's best to make that dependency explicit in the proposal.
Some use cases might help here. I'm having a hard time envisioning the separation of canonical vs advanced Spark users. Is there really a separation in which all users are willing to write Scala, R, or Python to use Spark, but then are not willing to write more code to work with the results? Regardless, I think it's important to call out that even if the user is primarily a Scala or R coder, the local results are only natively available via Python pandas DataFrames. (I'm not against it: just looking to make the corners more explicit.)
This is true, until you want to pass a non-trivial lambda to Spark that uses an external library to, say, parse a custom file format (or even a trivial one). Then the ideal of not having to worry about what's installed on all the Spark workers shatters.
On the topic of pros and cons, I think calling out how the proposal (and any others on this topic) will support the various Spark APIs (e.g., DataFrames, GraphX, Streaming.) If the results will be represented by pandas DataFrames in the local kernel, does this imply only code that returns Spark DataFrames will be supported, not RDDs or GraphX objects? Likewise, if the API is a pure REST API, does that imply Streaming will be out-of-scope? It's worth stating these in the proposal to levelset would-be users. |
Great points! On Sun, Sep 13, 2015 at 8:17 PM, Peter Parente notifications@github.com
Brian E. Granger |
Side question, I'm not expert on Spark, but should we really focus on "magics" ? Would it make sens to say that the magics would (just) be a test case for these libraries ? |
In this case, I think magics do make the most sense. The reason is that the https://github.com/cloudera/hue/tree/master/apps/spark/java#pyspark-example The options for dealing with an API like that are:
In this case, I think there will be 1 and 2, but layer 1 will be super But really the more complex API is actually PySpark, SparkR, Scala+Spark On Sun, Sep 13, 2015 at 8:49 PM, Matthias Bussonnier <
Brian E. Granger |
Ok, fair enough. As i said, I don't haveenough knowlege of spark/livy & co to judge. |
I agree with both of you, @ellisonbg and @Carreau. The way I see it, the magics will only be one of the possible usages of a python Livy client that will know how to interact with Livy once it receives a string. So, you could say that client is a library, and the magics use it :) |
I'm still working on the proposal, but I'll post it tomorrow. I'm working on adding all the feedback you've given us. Thanks! |
proposal addressing discussion so far
Updated proposal is posted now. I tried to address all the feedback so far. Thanks! |
Scope: | ||
* IPython magics to enable remote Spark code execution through [Livy](https://github.com/cloudera/hue/tree/master/apps/spark/java), a Spark REST endpoint, which allows for Scala, Python, and R support as of September 2015. | ||
* The project will create a Python Livy client that will be used by the magics. | ||
* The project will integrate the output of Livy client with the rich visualization framework that is being proposed [LINK]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would clarify that pandas data frames will be used as an intermediate in this chain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That clarification is in the additional notes below. Do you feel it should also be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably
Re: Spark driver <=> workers: There are well-understood practices in distributed systems for having the driver communicate with workers directly. IMO, that traces back to precedents from Mesos, two-level cluster schedulers, and the need for supporting fault-tolerance and multi-tenancy together. Google's "Omega" paper gives much more detail, plus performance analysis at scale that shows this rationale clearly. Monolithic cluster schedulers tend to hit a knee quickly (beyond ~10K executors), so the Mesos approach was to remove as much state from masters as possible so that they can be restarted rapidly. Workers maintain state, and in addition can replay logs from failed workers. Drivers must communicate directly to coordinate the recovery from failures, speculative execution, etc., any synchronization barriers needed in the workloads. Per the Omega analysis, alternatives involve transactions to capture distributed state, which become quite difficult and expensive. It may help to mention that Spark drivers often live within the cluster that they're using -- not necessarily, but it's a common practice. |
moving pandas dataframes context
Paco, thanks for this clarification, this helps me to understand the
On Wed, Sep 16, 2015 at 9:47 AM, Paco Nathan notifications@github.com
Brian E. Granger |
Definitely it'd be weird to optimize for huge edge cases. But no, this requirement is because of the distributed kernel, as a pre-req for leveraging containers (Mesos, Kubernetes, Omega, etc.) Even if your app only runs 2-3 workers, when it runs inside AWS, GCP, Azure, etc., then it's within the context of +10K server nodes, and the cluster managers has the large scale problem. Virtualization hid that problem, but it introduced bottlenecks. In the case Spark, we tend to see smaller clusters used per app than Hadoop (10-30x smaller, from what I see) however the infrastructure itself may be quite large -- with a large community of apps running multi-tenant. |
I don't have any more feedback at this point. I think the rest will be worked out in the details of the code. I am +1 on this sparkmagic proposal. |
+1 on the proposal |
👍 |
👍 as well... |
+1 too. |
We would like to declare consensus and accept this proposal. Congrats! We will create a repo here shortly and add everyone to it. |
sparkmagic incubation proposal.
Link to repo in proposals.md to be added once repo is created.