Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-server setup #109

Closed
jaredbischof opened this issue Dec 4, 2014 · 16 comments
Closed

Multi-server setup #109

jaredbischof opened this issue Dec 4, 2014 · 16 comments

Comments

@jaredbischof
Copy link

Am I correct in assuming that there are long-term plans to support multiple jupyterhub servers running as a single instance? I don't believe that this is possible right now unless I am missing something (please correct me if I'm wrong). It would be nice to have this for load balancing. Cheers!

@minrk
Copy link
Member

minrk commented Dec 4, 2014

@ssanderson is doing this, using postgres to mediate the Hub state. He would know best what's in the way of getting it to work.

@ssanderson
Copy link
Contributor

@jaredbischof what do you mean by "as a single instance"? It's possible to use a shared database between multiple jupyterhubs, but you have to ensure that the same user is routed to the correct single-user server and hub, which means you either need an external service that sits in front of the built-in proxy and knows how to route users, or you need a way to share the built-in proxy across multiple jupyterhubs. Both of those things are probably doable, but neither are especially supported out of the box.

@jaredbischof
Copy link
Author

Hi Scott, yeah I didn't know the best way to term what I meant but you're describing exactly what I'm talking about. Do you guys have a timeline for doing this? Just curious. Thanks for your response!

@ssanderson
Copy link
Contributor

I don't think there are any near-term plans to support that functionality in the main distribution. I'm working on a larger system for work (Quantopian) that manages clusters of jupyterhub servers. That project isn't in a particularly open-sourceable state, since it's wrapped up pretty intimately with our existing infrastructure.

@yuvipanda
Copy link
Contributor

I'm going to give it a shot over the next few weeks for jupyter.wmflabs.org, possibly using Docker Swarm. Will keep you guys posted on how it goes.

@minrk
Copy link
Member

minrk commented Mar 8, 2015

@yuvipanda great, thanks! You might look at https://github.com/compmodels/jupyterhub-deploy, where @jhamrick is using swarm to distribute user containers. That's not what is described here, though, which is multiple Hubs.

@yuvipanda
Copy link
Contributor

(Many moons later...)

So I've finally managed to set one up on https://tools.wmflabs.org/paws/hub/oauth_login, with a kubernetes backend. However, the jupyterhub instance itself (+ proxy) are running only once, so it's a SPOF.

So the two components that exist in the one 'hub' pod now are: The proxy and the jupyterhub itself. Am I right in assuming that if I can somehow synchronize state between all the proxies, and use a mysql/postgres backend for jupyter, I can scale both of these separately however horizontally I want? Is there any useful state in the jupyterhub process itself that isn't stored in the db?

If this is correct, I can probably work on a way to horizontally scale the proxy out, which should work... There are multiple ways to do this, from fanning out to all outputs via a wrapper vs implementing a different proxy with a compatible interface that uses etcd or something to sync data (that people who have more complex setups can use). But if jupyterhub itself stores state, we need to factor that out first...

@yuvipanda
Copy link
Contributor

This also helps solve my other problem, which is availability. I like having at least two of everything so I can drain one of traffic and do stuff to it...

So as questions:

  1. What state (if any?) is kept in the jupyterhub process itself?
  2. If the answer to (1) is 'None', will just providing a scalable proxy be enough?
  3. If the answer to (1) is not 'None' - what is the state that's kept in there?

I'm super interested in pushing this forward :)

@minrk
Copy link
Member

minrk commented Dec 1, 2015

The Hub process can be killed and resumed while leaving all other processes up, so there isn't any long-term state that resides in the Hub. All state is meant to reside in the database. #185 is probably the best illustration of state that resides in the process—mainly transients, such as spawn_pending, etc. I doubt it would behave properly if you made two simultaneous spawn requests of the same user on different Hubs using the same database. However, you should be able to do failover - start a second Hub and migrate URL handling before taking down the first Hub.

@yuvipanda
Copy link
Contributor

Ok, so I'll try and get the proxy to be distributable by putting some work into it this week and see how it goes!

@yuvipanda
Copy link
Contributor

Not quite the same, but somewhat related - I now have a nginx-based Configurable HTTP proxy that jupyterhub can easily use (https://github.com/yuvipanda/jupyterhub-nginx-chp) - it just implements most of the swagger spec and all the jupyterhub functionality I tested works fine. When I hit limits of that, we can probably write another one that scales better across multiple machines.

@willingc
Copy link
Contributor

willingc commented Jun 7, 2016

Good information by @yuvipanda, @ssanderson, and others related to this issue. As the issue is more than a year old and I'm not seeing a specific next action, I'm going to close this and mark it as "reference" so it will be discoverable and possibly included in future documentation. Thanks!

@kishorchintal
Copy link

I am trying to setup multiple hub instance behind ELB in AWS. Does JupyterHub support this kind of configuration yet? I have an ELB with two Jupyterhub instances attached to it and I've enabled SSL (Secured TCP) listeners so that it can connect to Python2/3 kernels. But when I access it via ELB it presents me a page from either one of these servers but when I try to click on 'Control Panel' or 'Home' or 'Create new notebook' it routes me to the other server and present me the login page again. Any directions to solve this will be much appreciated. Thanks

@minrk
Copy link
Member

minrk commented Oct 3, 2016

To run JupyterHub with multiple instances behind a load-balancer, you would have to ensure that the load balancer sends requests for the same user to the same Hub instance every time.

@jsill14
Copy link

jsill14 commented Mar 16, 2017

@minrk Do you need to ensure the load balancer sends the requests for the same user to the same Hub instance to ensure their data is there? If the data was stored or replicated across the Hubs could you spawn the users on any hub?

@minrk
Copy link
Member

minrk commented Mar 17, 2017

Since a user's server is persistent, you have to make sure that the same Hub gets every request for a given user, at least while that user's server is running in order to route requests properly. This is in-memory state for the Hub, so it isn't shared across instances. The Hub can reconstruct this from information persisted to the database, and does this at startup, but it doesn't reconstruct the state on every request, which would be needed for a single user to be handled correctly across multiple Hubs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants