Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparkmagic proposal #2

Merged
merged 4 commits into from Sep 21, 2015

Conversation

@aggFTW
Copy link
Contributor

commented Sep 11, 2015

sparkmagic incubation proposal.

Link to repo in proposals.md to be added once repo is created.

sparkmagic proposal
sparkmagic proposal
@rgbkrk

This comment has been minimized.

Copy link
Member

commented Sep 12, 2015

Super exciting!


Scope:
* Spark will be the first back end provided by these magics, but it could easily be extended to other big data back ends.
* Remote Spark execution through a REST endpoint which allows for Scala, Python, and R support as of Sept 2015.

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Sep 12, 2015

Member

What's the deadline for?

This comment has been minimized.

Copy link
@aggFTW

aggFTW Sep 12, 2015

Author Contributor

It's not a deadline. It's just stating the set of languages currently supported by Spark.

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Sep 12, 2015

Member

Ah, now I gotcha


Alternatives we know of:

* Combination of IPython, R kernel (or rpy2), and Scala kernel for an in-Spark-cluster Jupyter installation. This does not allow the user to point to different Spark clusters. It might also result in resouce contention (CPU or memory) between the Jupyter installation and Spark.

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Sep 12, 2015

Member

s/resouce/resource/

This comment has been minimized.

Copy link
@aggFTW

aggFTW Sep 12, 2015

Author Contributor

Thanks!

typo resource
Fixing typo.
Alternatives we know of:

* Combination of IPython, R kernel (or rpy2), and Scala kernel for an in-Spark-cluster Jupyter installation. This does not allow the user to point to different Spark clusters. It might also result in resource contention (CPU or memory) between the Jupyter installation and Spark.
* IBM's Spark kernel does not provide a REST endpoint, requires the installation of Jupyter in the cluster, and does not create pandas dataframes.

This comment has been minimized.

Copy link
@rgbkrk

rgbkrk Sep 12, 2015

Member

These sound like all negatives! What are lessons to be learned from what the spark kernel currently has? Would that team want to get involved on this proposal?

This comment has been minimized.

Copy link
@damianavila

damianavila Sep 12, 2015

Member

Also it would be nice to have links to all these projects...

This comment has been minimized.

Copy link
@ellisonbg

ellisonbg Sep 12, 2015

Contributor

As I have learned more about the way that spark is usually typically used, it makes less and less sense to me. Having to run pyspark, sparkr, etc. directly on the spark cluster breaks so many of a good lessons of the modern web and distributed architectures. I think talking to spark over REST makes tons of sense and is a huge step forward. It nicely separates the usage of spark (just HTML requests!) from its installation and deployment. It also enables a much more flexible set of ways for integrating with Jupyter as kernels no longer have to be run directly on the spark nodes. I am hoping to get feedback from the folks who built the spark kernel though to see how this vision plays with what they are thinking. @vinomaster @parente

This comment has been minimized.

Copy link
@parente

parente Sep 12, 2015

Member

IBM's Spark kernel does not provide a REST endpoint, requires the installation of Jupyter in the cluster, and does not create pandas dataframes.

Personally, I've always treated the "IBM Spark Kernel" as a Scala kernel for Jupyter that has all the pieces needed to talk to Spark from Scala, and gives you SparkContext by default when you run it. No more, no less. It's not really all that different from the other Jupyter language kernels IMHO.

I think talking to spark over REST makes tons of sense and is a huge step forward.

It is a step forward from having to run the entire Jupyter notebook server plus kernels on the same L3 network as the Spark workers. But keep in mind a pure REST API will likely not work well with Spark Streaming which is one of the huge draws of Spark.

It also enables a much more flexible set of ways for integrating with Jupyter as kernels no longer have to be run directly on the spark nodes.

Again, while a REST API is a nice simple first step toward remote access to a Spark cluster, there are potential benefits to keeping the kernels on / near the compute cluster but running the Notebook web server app remotely, namely:

  • Compute driver (kernel) stays close to compute workers (Spark) stays close to (big) data
  • You get streaming "for free" thanks to the Jupyter protocol and implementation (0mq + websockets)
  • Enabling kernels to run remote from the Notebook server has other benefits beyond remote access to Spark

The last is the impetus for the potential Kernel Provisioner and Gateway API proposal. I've spent most my time so far documenting potential uses case for it, the third of which on that page could cover remote access to a Spark (or any) compute cluster. I'm still figuring out how to bring it forward (JEP? incubator?) and consulting with @rgbkrk, @freeman-lab, et al about how it fits with other efforts under way. That said, I think it will take some time to realize and so please don't see it as discouraging this magics + REST API proposal. I don't believe there needs to be one and only one way to get Jupyter to work with Spark.

/cc @lbustelo

This comment has been minimized.

Copy link
@freeman-lab

freeman-lab Sep 13, 2015

@ellisonbg @parente One thing maybe to add here... for the vanilla approach, item number 1, it's typical for the "driver", which constructs the execution graph and submits it to the Spark master, to be on the same machine as the "master", which is part of the cluster -- but this isn't necessary.

The driver can be on a different machine so long as it's addressable from the workers (see here)

In our own deployments, we used to run driver and master on the same machine, but due to resource conflict concerns, we started running the driver on a different machine. This solves the resource issue, and otherwise works the same -- just start up a Jupyter notebook and create a SparkContext. And we regularly use this to talk to multiple clusters.

I generally agree with @ellisonbg that there's a lot to like about the RESTful model, but just wanted to add this to the discussion!

This comment has been minimized.

Copy link
@lbustelo

lbustelo Sep 13, 2015

This is the last thing I would say tonight cause I need to pay attention to the football game... No sure it is fair to put the pandas data frame statement on the spark kernel. First, it is a Scala kernel so of course it does not have a pandas data frame, but the same can be said about all the other Jupyter kernels that are not Python based. Let's remember the this is not ipython, it is Jupyter.

This comment has been minimized.

Copy link
@aggFTW

aggFTW Sep 13, 2015

Author Contributor

Awesome discussion! Thanks so much for your contributions.

Let me first try to address how Spark works to get to a shared understanding. If I've misunderstood something, please let me know.

The spark driver is the program that holds the spark context and creates the metadata around the RDDs, which are then evaluated by worker nodes. This driver DOES need network access to nodes in the cluster.

So, as long as you have an executable that has network access to the master in the cluster, for the drivers it creates, and can ask the master to do work, you'll be able to use Spark. This executable could be spark-submit, the pyspark shell, Livy, an IPython kernel, a Scala kernel, IBM's Spark kernel...

The architecture we are proposing for these magics is one such that the magics would talk to a REST endpoint (we are thinking Livy), that can create different drivers for the user and could fetch results for the user. You send a string of [python, scala, R] code that Livy relays to the Spark driver it created and start getting the results back from Livy over http. You could certainly choose to have a remote kernel that creates your drivers for you and use 0mq, but it's our believe that using a REST endpoint would extend better for other apps other than Jupyter kernels.

The benefits of having a remote installation of Jupyter that is able to connect to different clusters by virtue of changing the URL endpoint is manyfold:

  1. Your notebooks are valuable by themselves, without the need to have a cluster running. You can look at them and play with them if the cluster is gone.
  2. Any number of users could point to the same endpoint from their notebooks. They might have custom installs of Jupyter or other kernels running for them.
  3. Resource contention between Jupyter's kernels and Spark is gone. There might be resource contention between the driver and the master too but that's a different topic, and @freeman-lab 's solution definitely works.

There are challenges with the remote spark submission scenario, like figuring out the right amount of data from the result set to bring back to the client via the wire (regardless of the protocol used). Is it a sample or the top of the result set? I believe we'll have to work through these challenges regardless of the implementation chosen for the remote submission scenario.

Now, the beauty of doing something like this is that, under the hood, this is all http requests via Python. The user can be typing Scala or R code that gets executed in the cluster, but locally, it's all Python code and there's no need to improve multiple kernels to support integration with different languages for Spark. It is the purpose of these magics to allow users to do automatic rich visualizations of their computations in Spark (think Zeppelin, which is written in Scala but supports different languages) by integrating with the library that @ellisonbg is writing a proposal for. Advanced users might want to interact with the raw pandas dataframes that the kernel is using to visualize the data, but that's up to them. Canonical Spark users would do their computations in the cluster (keeping data and compute close together) and only retrieve the end result of their computations.

Again, loving the conversation, and I'm very eager to hear your feedback!

This comment has been minimized.

Copy link
@parente

parente Sep 13, 2015

Member

@aggFTW Thanks for the clarification. It helps frame the context. I think the proposal might benefit from some of what you just said ultimately winding up in it.

One thing and then I'll stop for the night ...

This driver DOES need network access to nodes in the cluster.

The above is true but ...

So, as long as you have an executable that has network access to the master in the cluster, for the drivers it creates, and can ask the master to do work, you'll be able to use Spark.

This is necessary but not sufficient.

To be 100% clear, the workers in the Spark cluster need access to the driver as well. By this I mean the driver is also a server with a set of ports listening for connections from Spark executors which are also clients. Every worker in the Spark cluster must be able to establish a network connection back to the driver.

Sounds crazy, I know, and that's why I'm trying to emphasize it. It's just how Spark works.

This comment has been minimized.

Copy link
@aggFTW

aggFTW Sep 13, 2015

Author Contributor

Yes! I'm aware of that =) Thanks for clarifying it!

I'll work on an update for the proposal on Monday so that it's easier to understand.

This comment has been minimized.

Copy link
@freeman-lab

freeman-lab Sep 13, 2015

Awesome discussion everyone! Super helpful, and will be great to see some of this in the proposal, especially this distinction from @aggFTW , which seems key

It is the purpose of these magics to allow users to do automatic rich visualizations of their computations in Spark ... Canonical Spark users would do their computations in the cluster (keeping data and compute close together) and only retrieve the end result of their computations.

We've often had no choice but to aggregate large results with Spark + Jupyter (esp true in machine learning applications), and I've worried about separation making that even worse. But the magics could definitely target use cases where smart summaries make it far less of an issue.

## Integration with Project Jupyter

These magics will be pip-installable and loadable from any Jupyter/IPython installation.
By virtue of returning pandas dataframes, the dataframes will be easily visualizable by using the library created by the automatic rich visualizations incubation subproject.

This comment has been minimized.

Copy link
@damianavila

damianavila Sep 12, 2015

Member

I guess this is related with the email Bryan sent to the list some days ago... it would be nice to link that thread until there is a visualization subproject page/proposal/pr.
If it is another project, it should be referenced.

In general terms, you need to make references when you are talking about others projects 😉

This comment has been minimized.

Copy link
@ellisonbg

ellisonbg Sep 12, 2015

Contributor

I am in the process of authoring a separate incubation proposal for the vis side of this.

This comment has been minimized.

Copy link
@damianavila

damianavila Sep 12, 2015

Member

Great, thanks @ellisonbg.

This comment has been minimized.

Copy link
@aggFTW

aggFTW Sep 13, 2015

Author Contributor

Thanks @damianavila! I'll add the links.

* Allow any Jupyter installation to point to different Spark installations.

Scope:
* Spark will be the first back end provided by these magics, but it could easily be extended to other big data back ends.

This comment has been minimized.

Copy link
@ellisonbg

ellisonbg Sep 12, 2015

Contributor

Do you have other examples of other backends this might talk to. How serious are you about other backends? My initial thought is to keep the scope of this small and limit to spark as it is a fairly big ecosystem itself.

This comment has been minimized.

Copy link
@aggFTW

aggFTW Sep 13, 2015

Author Contributor

You are right, I'll remove that line.

@lbustelo

This comment has been minimized.

Copy link

commented Sep 13, 2015

I guess what I'm missing is what is this rest api? As far as I know, there is no spark rest api, although if there was one it would be awesome. How is this rest API going to consume lambda functions in the variety of languages that spark and Jupyter supports? How much of a task would it be to keep this api in line with spark's? Map Reduce, streaming, mlib, graph, etc...

@parente proposal of kernel provisioned and remote kernels is more in line with the user experience of the notebook user. Just connect to a kernel that supports spark.

@parente

This comment has been minimized.

Copy link
Member

commented Sep 13, 2015

I guess what I'm missing is what is this rest api?

Good point! I had it in my head that the server-side implementation here is still Livy as per the google group discussion. But you're right, I don't see Livy mentioned in the proposal.

@lbustelo

This comment has been minimized.

Copy link

commented Sep 13, 2015

Livy feels like an odd choice. If you think about it, Livy is yet another competing REPL environment with yet another protocol. It would be analogous to having a Jupyter kernel talking to another "kernel". I'm writing code in a notebook cell that is send to kernel 1 just to get package in a REST envelope to get sent and executed by Livy's "kernel". I guess it works, but odd.

If anything, Jupyter should learn from newcomers like Livy, Zepellin and others and think about adopting alternative options for kernel communication (web sockets anyone?)

@Carreau

This comment has been minimized.

Copy link
Member

commented Sep 13, 2015

(web sockets anyone?)

That's how js frontend communicate to kernels. Tornado is "just" a websocket/ZMQ bridge.
ZMQ library is more widespread than websocket, and have more communication patterns.

@ellisonbg

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2015

Thanks everyone for the comments. I didn't realize that the drive also
listened and had to be available for workers to connect to. Is there some
fundamental reason Spark is designed that way - it appears to just be bad
distributed system design?

Even though in principle it is possible to run the driver elsewhere, in
practice that doesn't seem very realistic - minimally it has to be on the
same set of hosts with a clear security/firewall context.

I do think that the remote kernel APIs that folks are working on will free
us up for users to connect to kernels running on remote systems (remote
from the notebook server and documents). That is a very important step
forward.

The data locality aspect is important and talking to spark over Livy
definitely limits the amount of data a users would be able to work with
locally.

There are other some other pros and cons of the two approaches:

  • running Jupyter kernels local to the drivers forces a user to have all of
    their code and packages on that system. It isn't just about install the
    spark+jupyter+kernels on that system, but also every bit of code you might
    want to run. At that point, you have to effectively treat that system as
    your system.
  • Talking to spark using Livy allows you to only worry about installing,
    running spark+livy on that system - everything else, all of you libraries,
    etc .can be on other systems where you are doing the rest of your work.
  • Using Livy allows someone to using PySpark/SparkR/Scala+spark without any
    local install of all those environments.

Because these things and the other's mentioned above, both modes (direct
pyspark/sparkR and magics+livy) are really important to support and have
work well.

Alejandro, it would be great to add links to Livy and the other things you
mention in the proposal.

On Sat, Sep 12, 2015 at 9:19 PM, Matthias Bussonnier <
notifications@github.com> wrote:

(web sockets anyone?)

That's how js frontend communicate to kernels. Tornado is "just" a
websocket/ZMQ bridge.
ZMQ library is more widespread than websocket, and have more communication
patterns.


Reply to this email directly or view it on GitHub
#2 (comment)
.

Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgranger@calpoly.edu and ellisonbg@gmail.com

@parente

This comment has been minimized.

Copy link
Member

commented Sep 14, 2015

Couple more comments about various aspects of the proposal and discussion above.

The architecture we are proposing for these magics is one such that the magics would talk to a REST endpoint (we are thinking Livy)

I would add Livy as your leading candidate for the implementation to the proposal.

... and there's no need to improve multiple kernels to support integration with different languages for Spark

Right, but now there's a dependency on ensuring that Livy, or whatever REST endpoint is chosen, continues to be maintained external to the Jupyter community for these magics to work. I think it's best to make that dependency explicit in the proposal.

Advanced users might want to interact with the raw pandas dataframes that the kernel is using to visualize the data, but that's up to them. Canonical Spark users would do their computations in the cluster (keeping data and compute close together) and only retrieve the end result of their computations.

Some use cases might help here. I'm having a hard time envisioning the separation of canonical vs advanced Spark users. Is there really a separation in which all users are willing to write Scala, R, or Python to use Spark, but then are not willing to write more code to work with the results?

Regardless, I think it's important to call out that even if the user is primarily a Scala or R coder, the local results are only natively available via Python pandas DataFrames. (I'm not against it: just looking to make the corners more explicit.)

Talking to spark using Livy allows you to only worry about installing, running spark+livy on that system - everything else, all of you libraries, etc .can be on other systems where you are doing the rest of your work.

This is true, until you want to pass a non-trivial lambda to Spark that uses an external library to, say, parse a custom file format (or even a trivial one). Then the ideal of not having to worry about what's installed on all the Spark workers shatters.

There are other some other pros and cons of the two approaches:

On the topic of pros and cons, I think calling out how the proposal (and any others on this topic) will support the various Spark APIs (e.g., DataFrames, GraphX, Streaming.) If the results will be represented by pandas DataFrames in the local kernel, does this imply only code that returns Spark DataFrames will be supported, not RDDs or GraphX objects? Likewise, if the API is a pure REST API, does that imply Streaming will be out-of-scope? It's worth stating these in the proposal to levelset would-be users.

@ellisonbg

This comment has been minimized.

Copy link
Contributor

commented Sep 14, 2015

Great points!

On Sun, Sep 13, 2015 at 8:17 PM, Peter Parente notifications@github.com
wrote:

Couple more comments about various aspects of the proposal and discussion
above.

The architecture we are proposing for these magics is one such that the
magics would talk to a REST endpoint (we are thinking Livy)

I would add Livy as your leading candidate for the implementation to the
proposal.

... and there's no need to improve multiple kernels to support integration
with different languages for Spark

Right, but now there's a dependency on ensuring that Livy, or whatever
REST endpoint is chosen, continues to be maintained external to the Jupyter
community for these magics to work. I think it's best to make that
dependency explicit in the proposal.

Advanced users might want to interact with the raw pandas dataframes that
the kernel is using to visualize the data, but that's up to them. Canonical
Spark users would do their computations in the cluster (keeping data and
compute close together) and only retrieve the end result of their
computations.

Some use cases might help here. I'm having a hard time envisioning the
separation of canonical vs advanced Spark users. Is there really a
separation in which all users are willing to write Scala, R, or Python to
use Spark, but then are not willing to write more code to work with the
results?

Regardless, I think it's important to call out that even if the user is
primarily a Scala or R code, the local results are only natively available
via Python pandas DataFrames. (I'm not against it: just looking to make the
corners more explicit.)

Talking to spark using Livy allows you to only worry about installing,
running spark+livy on that system - everything else, all of you libraries,
etc .can be on other systems where you are doing the rest of your work.

This is true, until you want to pass a non-trivial lambda to Spark that
uses an external library to, say, parse a custom file format (or even a
trivial one https://github.com/databricks/spark-csv). Then the ideal of
not having to worry about what's installed on all the Spark workers
shatters.

There are other some other pros and cons of the two approaches:

On the topic of pros and cons, I think calling out how the proposal (and
any others on this topic) will support the various Spark APIs (e.g.,
DataFrames, GraphX, Streaming.) If the results will be represented by
pandas DataFrames in the local kernel, does this imply only code that
returns Spark DataFrames will be supported, not RDDs or GraphX objects?
Likewise, if the API is a pure REST API, does that imply Streaming will be
out-of-scope? It's worth stating these in the proposal to levelset would-be
users.


Reply to this email directly or view it on GitHub
#2 (comment)
.

Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgranger@calpoly.edu and ellisonbg@gmail.com

@Carreau

This comment has been minimized.

Copy link
Member

commented Sep 14, 2015

Side question, I'm not expert on Spark, but should we really focus on "magics" ?
Magics have always been only thin wrappers around libraries call.
I think we shouldn't loose focus on python/scala/whatever script usage there are, and I would much rather put effort into a potententialy slightly more complex libray that is reusable than into "magics".

Would it make sens to say that the magics would (just) be a test case for these libraries ?

@ellisonbg

This comment has been minimized.

Copy link
Contributor

commented Sep 14, 2015

In this case, I think magics do make the most sense. The reason is that the
REST API for spark (Livy) expects you to pass it blocks of PySpark, SparkR
or Scala code as strings:

https://github.com/cloudera/hue/tree/master/apps/spark/java#pyspark-example

The options for dealing with an API like that are:

  1. Objects and function which take those strings.
  2. Wrapping those objects and functions in magics to make it nice.
  3. Custom wrapper kernels (I don't think this makes sense as it will span
    all 3 languages)

In this case, I think there will be 1 and 2, but layer 1 will be super
simple (basically passing those strings to HTTP requests and getting back
the results). Thus the focus here on the magics.

But really the more complex API is actually PySpark, SparkR, Scala+Spark
itself and we do have separate work on those going on (as PRs directly on
the upstream repos).

On Sun, Sep 13, 2015 at 8:49 PM, Matthias Bussonnier <
notifications@github.com> wrote:

Side question, I'm not expert on Spark, but should we really focus on
"magics" ?
Magics have always been only thin wrappers around libraries call.
I think we shouldn't loose focus on python/scala/whatever script usage
there are, and I would much rather put effort into a potententialy slightly
more complex libray that is reusable than into "magics".

Would it make sens to say that the magics would (just) be a test case for
these libraries ?


Reply to this email directly or view it on GitHub
#2 (comment)
.

Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgranger@calpoly.edu and ellisonbg@gmail.com

@Carreau

This comment has been minimized.

Copy link
Member

commented Sep 14, 2015

Ok, fair enough. As i said, I don't haveenough knowlege of spark/livy & co to judge.

@aggFTW

This comment has been minimized.

Copy link
Contributor Author

commented Sep 14, 2015

I agree with both of you, @ellisonbg and @Carreau. The way I see it, the magics will only be one of the possible usages of a python Livy client that will know how to interact with Livy once it receives a string. So, you could say that client is a library, and the magics use it :)

@aggFTW

This comment has been minimized.

Copy link
Contributor Author

commented Sep 15, 2015

I'm still working on the proposal, but I'll post it tomorrow. I'm working on adding all the feedback you've given us.

Thanks!

updated proposal
proposal addressing discussion so far
@aggFTW

This comment has been minimized.

Copy link
Contributor Author

commented Sep 15, 2015

Updated proposal is posted now. I tried to address all the feedback so far. Thanks!

Scope:
* IPython magics to enable remote Spark code execution through [Livy](https://github.com/cloudera/hue/tree/master/apps/spark/java), a Spark REST endpoint, which allows for Scala, Python, and R support as of September 2015.
* The project will create a Python Livy client that will be used by the magics.
* The project will integrate the output of Livy client with the rich visualization framework that is being proposed [LINK].

This comment has been minimized.

Copy link
@ellisonbg

ellisonbg Sep 16, 2015

Contributor

I would clarify that pandas data frames will be used as an intermediate in this chain

This comment has been minimized.

Copy link
@aggFTW

aggFTW Sep 16, 2015

Author Contributor

That clarification is in the additional notes below. Do you feel it should also be here?

This comment has been minimized.

Copy link
@ellisonbg

ellisonbg Sep 16, 2015

Contributor

Probably

@ceteri

This comment has been minimized.

Copy link

commented Sep 16, 2015

Re: Spark driver <=> workers:

There are well-understood practices in distributed systems for having the driver communicate with workers directly. IMO, that traces back to precedents from Mesos, two-level cluster schedulers, and the need for supporting fault-tolerance and multi-tenancy together. Google's "Omega" paper gives much more detail, plus performance analysis at scale that shows this rationale clearly. Monolithic cluster schedulers tend to hit a knee quickly (beyond ~10K executors), so the Mesos approach was to remove as much state from masters as possible so that they can be restarted rapidly. Workers maintain state, and in addition can replay logs from failed workers. Drivers must communicate directly to coordinate the recovery from failures, speculative execution, etc., any synchronization barriers needed in the workloads. Per the Omega analysis, alternatives involve transactions to capture distributed state, which become quite difficult and expensive. It may help to mention that Spark drivers often live within the cluster that they're using -- not necessarily, but it's a common practice.

moving pandas dataframes context
moving pandas dataframes context
@ellisonbg

This comment has been minimized.

Copy link
Contributor

commented Sep 17, 2015

Paco, thanks for this clarification, this helps me to understand the
rationale. A few comments:

  • Even if you need to have a driver that is co-located with the workers
    from a network perspective, that doesn't mean that driver has to be the
    sole human contact point. This is especially true when ultimately all those
    drivers do is exec strings of Python/R code. That is part of what I like
    about the Livy idea - it clarifies to dual role the drivers actually play
    and separates out one of those roles.
  • If Spark is like most many other distributed systems, most people who use
    it don't run with ~10k workers. I would guess (a complete guess!) that most
    spark clusters are well below the ~10k range. While I understand that
    some people need massive scalability, I have always thought it was silly to
    make most users of a system suffer from usability problems just because of
    of the needs of a few.

On Wed, Sep 16, 2015 at 9:47 AM, Paco Nathan notifications@github.com
wrote:

Re: Spark driver <=> workers:

There are well-understood practices in distributed systems for having the
driver communicate with workers directly. IMO, that traces back to
precedents from Mesos, two-level cluster schedulers, and the need for
supporting fault-tolerance and multi-tenancy together. Google's "Omega"
paper gives much more detail, plus performance analysis at scale that shows
this rationale clearly. Monolithic cluster schedulers tend to hit a knee
quickly (beyond ~10K executors), so the Mesos approach was to remove as
much state from masters as possible so that they can be restarted rapidly.
Workers maintain state, and in addition can replay logs from failed
workers. Drivers must communicate directly to coordinate the recovery from
failures, speculative execution, etc., any synchronization barriers needed
in the workloads. Per the Omega analysis, alternatives involve transactions
to capture distributed state, which become quite difficult and expensive.
It may help to mention that Spark driv ers ofte n live within the cluster
that they're using -- not necessarily, but it's a common practice.


Reply to this email directly or view it on GitHub
#2 (comment)
.

Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgranger@calpoly.edu and ellisonbg@gmail.com

@ceteri

This comment has been minimized.

Copy link

commented Sep 18, 2015

Definitely it'd be weird to optimize for huge edge cases. But no, this requirement is because of the distributed kernel, as a pre-req for leveraging containers (Mesos, Kubernetes, Omega, etc.) Even if your app only runs 2-3 workers, when it runs inside AWS, GCP, Azure, etc., then it's within the context of +10K server nodes, and the cluster managers has the large scale problem. Virtualization hid that problem, but it introduced bottlenecks. In the case Spark, we tend to see smaller clusters used per app than Hadoop (10-30x smaller, from what I see) however the infrastructure itself may be quite large -- with a large community of apps running multi-tenant.

@ellisonbg

This comment has been minimized.

Copy link
Contributor

commented Sep 18, 2015

I don't have any more feedback at this point. I think the rest will be worked out in the details of the code.

I am +1 on this sparkmagic proposal.

@minrk

This comment has been minimized.

Copy link

commented Sep 18, 2015

+1 on the proposal

@rgbkrk

This comment has been minimized.

Copy link
Member

commented Sep 18, 2015

👍

@damianavila

This comment has been minimized.

Copy link
Member

commented Sep 19, 2015

👍 as well...

@fperez

This comment has been minimized.

Copy link

commented Sep 20, 2015

+1 too.

@ellisonbg

This comment has been minimized.

Copy link
Contributor

commented Sep 21, 2015

We would like to declare consensus and accept this proposal. Congrats! We will create a repo here shortly and add everyone to it.

ellisonbg added a commit that referenced this pull request Sep 21, 2015

@ellisonbg ellisonbg merged commit b3a5b36 into jupyter-incubator:master Sep 21, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.