-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process starvation in concurrent kernel starts, and scaling out JEG for scalability. #732
Comments
Thanks Evan. I'm cross-referencing the discourse link here: https://discourse.jupyter.org/t/scalable-enterprise-gateway/2014 |
I'm working on Submitter idea by using celery as a PoC. It means this PoC can introduce another microservice dependency; i.e. Redis, Celery (I prefer redis to rabbitmq as a celery message broker because it can be also used as a persistent session store 😁 ) My idea is to replace It means I'm working on implementing It may be so far from now or totally different from this issue, but I'm guessing we can make "Kernel start" lazy by responding job id to @kevin-bates Could you please leave the comment about lazy kernel start? |
@esevan - thanks for looking into this. Got a couple comments.
|
@kevin-bates Thank you for the detail!
|
As for the persistent layer, I would also recommend defining an abstraction layer so one can choose between mongo or any other data store they use/need. |
Description
Recently I’ve monitored EG cannot handle concurrent 30 kernel start requests. Here’s the itest code.
I’ve set
LAUNCH_TIMEOUT
to 60 seconds, and used kernelspecs already pulled in the node. In case of Spark kernel, the situation got worse becausespark-submit
processes launched by EG makes process starvation among EG process and otherspark-submit
processes.When I did the test, CPU utilization rose up to more than 90%. (4 core, 8GiB memory instance)
I know that there’s work for HA in progress, but it looks like Active / Stand-by mode. In that approach, we couldn’t make EG scale-out, but scale-up. However, “Scale Up” always has limitations in that we cannot expand our instance to the size bigger than the node EG is running on.
In those reasons, I want to start to increase the scalability of EG, and need your opinion about the following idea. (Let me just assume that EG is running on k8s)
Process starvation
spark-submit
pod andlaunch-kubernetes
pod instead of launching processes. Using container, isolate the spark-submit process from EG instance.submitter
pod. submitter pod queues the requests from EG, and launch processes with limited process pool. This submitter pod is also scalable while EG is not scalable yet because EG always passes the parameters for launching a process.Session Persistence (duplicate of High Availability - session persistence on bare metal machines #562, Implementing HA Active/Active with distributed file system #594 )
Through those two resolutions, I think we can scale out EG instances. Any advices will be appreciated.
Thanks.
Environment
Enterprise Gateway Version 2.x with Asynchronous kernel start feature (#580)
Platform: Kubernetes
Others: nb2kg latest
The text was updated successfully, but these errors were encountered: