Process starvation in concurrent kernel starts, and scaling out JEG for scalability. #732

esevan · 2019-09-10T02:01:13Z

Description

Recently I’ve monitored EG cannot handle concurrent 30 kernel start requests. Here’s the itest code.

    def scale_test(kernelspec, example_code, _):
        """test function for scalability test"""
        res = True
        gateway_client = GatewayClient()

        kernel = None
        try:
            # scalability test is a kind of stress test, so expand launch_timeout to our service request timeout.
            kernel = gateway_client.start_kernel(kernelspec)
            if example_code:
                res = kernel.execute(example_code)
        finally:
            if kernel is not None:
                gateway_client.shutdown_kernel(kernel)

        return res

    class TestScale(unittest.TestCase):
        def _scale_test(self, spec, test_max_scale):
            LOG.info('Spawn {} {} kernels'.format(test_max_scale, spec))
            example_code = []
            example_code.append('print("Hello World")')

            with Pool(processes=test_max_scale) as pool:
                children = []
                for i in range(test_max_scale):
                    children.append(pool.apply_async(scale_test,
                                                     (self.KERNELSPECS[spec],
                                                      example_code,
                                                      i)))

                test_results = [child.get() for child in children]

            for result in test_results:
                self.assertRegexpMatches(result, "Hello World")

        def test_python3_scale_test(self):
            test_max_sacle = int(self.TEST_MAX_PYTHON_SCALE)
            self._scale_test('python3', test_max_scale)

        def test_spark_python_scale_test(self):
            test_max_sacle = int(self.TEST_MAX_SPARK_SCALE)
            self._scale_test('spark_python', test_max_scale)

I’ve set LAUNCH_TIMEOUT to 60 seconds, and used kernelspecs already pulled in the node. In case of Spark kernel, the situation got worse because spark-submit processes launched by EG makes process starvation among EG process and other spark-submit processes.

When I did the test, CPU utilization rose up to more than 90%. (4 core, 8GiB memory instance)

I know that there’s work for HA in progress, but it looks like Active / Stand-by mode. In that approach, we couldn’t make EG scale-out, but scale-up. However, “Scale Up” always has limitations in that we cannot expand our instance to the size bigger than the node EG is running on.

In those reasons, I want to start to increase the scalability of EG, and need your opinion about the following idea. (Let me just assume that EG is running on k8s)

Process starvation
- In order to resolve process starvation in EG instance, I have two ideas.
  - Spawn spark-submit pod and launch-kubernetes pod instead of launching processes. Using container, isolate the spark-submit process from EG instance.
  - Create another submitter pod. submitter pod queues the requests from EG, and launch processes with limited process pool. This submitter pod is also scalable while EG is not scalable yet because EG always passes the parameters for launching a process.
Session Persistence (duplicate of High Availability - session persistence on bare metal machines #562, Implementing HA Active/Active with distributed file system #594 )
- This is actually very flaky issue which has high possibility to make a lot of side effects. My idea is to move all session objects into external in-memory db such as Redis. So everytime EG needs to access session, it reads the session information from the Redis. I cannot estimate how many sources should be modified and never looked into the source code yet. But I’m guessing I have to change Session Manager and Kernel Manager classes. Could anybody give me a feedback about this?

Through those two resolutions, I think we can scale out EG instances. Any advices will be appreciated.

Thanks.

Environment

Enterprise Gateway Version 2.x with Asynchronous kernel start feature (#580)
Platform: Kubernetes
Others: nb2kg latest

The text was updated successfully, but these errors were encountered:

kevin-bates · 2019-09-10T17:01:06Z

Thanks Evan. I'm cross-referencing the discourse link here: https://discourse.jupyter.org/t/scalable-enterprise-gateway/2014

esevan · 2019-09-19T11:37:13Z

I'm working on Submitter idea by using celery as a PoC. It means this PoC can introduce another microservice dependency; i.e. Redis, Celery (I prefer redis to rabbitmq as a celery message broker because it can be also used as a persistent session store 😁 )

My idea is to replace jupyter_client.launch_kernel with remote_launcher.launch_kernel returning RemoteLauncherProcess which supports poll, kill, send_signal, and wait as same as subprocess.Popen class . The remote_launcher can be implemented to many types.

It means I'm working on implementing celery_launcher and CeleryLauncherProcess in the first place.

It may be so far from now or totally different from this issue, but I'm guessing we can make "Kernel start" lazy by responding job id to nb2kg before kernel is connected.

@kevin-bates Could you please leave the comment about lazy kernel start? nb2kg tricks the frontend as kernel's been already connected but still polls kernel status to EG. When EG is connected to the kernel, nb2kg lazily tries to connect the kernel and report the real kernel status ("IDLE") to frontend.

kevin-bates · 2019-09-19T16:42:32Z

@esevan - thanks for looking into this. Got a couple comments.

We can't modify nb2kg since that is tantamount to a protocol change - in this case, the job id. EG exposes a REST API and we must view any client-side interactions relative to that. There are many users of EG that do not use nb2kg, so features that trickle into the client are difficult to justify.
Your process launch abstraction is exactly what the process proxies are all about. They just happen to use jupyter_client.launch_kernel. I would prefer we not introduce another abstract layer - but look at extending the process proxies.
We need to keep in mind that there's a high likelihood we'll be moving to the kernel providers architecture at some point (see the kernel-providers in this org - they should look very familiar 😄). When that time comes, functionality in EG will be decoupled from kernel lifecycle management. Some of this could be done in the RemoteKernelProvider (which is essentially the RemoteKernelManager/RemoteProcessProxy code), but it must be completely self-contained. (In this env, things like Notebook could run remote kernels w/o EG.)
I have no objections to celery, redis, etc. However, I want us to try to adopt third-party applications that might already have been adopted (and vetted) by other projects of the Jupyter community. I see celery also supports MongoDB and Mongo is one that I think has been used elsewhere. I know that hub uses sqllite as its default store and Notebook has sqllite code in it, so I don't know if that would be helpful. Just saying we should consider what existing Jupyter installations might already be using so we don't end up with a bunch of different no-SQL DBs, etc. in use by default.
If at all possible, a thin abstraction layer over whatever third-party is in use might go a long way in allowing others to create a similar shim and bring their own application.

esevan · 2019-09-20T00:28:38Z

@kevin-bates Thank you for the detail!

I easily forget nb2kg is one of EG's clients. Yes I agree with we cannot modify nb2kg :(
Right, good point! I don't have to introduce new classes. I'll extend ProcessProxy 👍
This is great. Do you have any video or document of these decent projects? I'll stay tuned for these projects 👍
Understood. After the PoC (by monitoring how process starvation is relieved), I'm starting with a module w/o dependency on 3rd-party apps.
I agree with "thin abstraction layer". I'll go ahead with thin abstraction layer extending existing ProcessProxy. Thanks!

lresende · 2019-09-20T14:57:27Z

As for the persistent layer, I would also recommend defining an abstraction layer so one can choose between mongo or any other data store they use/need.

esevan mentioned this issue Sep 10, 2019

Number of notebooks with EG #726

Open

esevan added enhancement performance & scalability labels Sep 10, 2019

jasongrout mentioned this issue Sep 25, 2019

Weekly Dev Call Meeting Minutes 2019-09-25 jupyterlab/frontends-team-compass#18

Closed

kevin-bates added the configuration label Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process starvation in concurrent kernel starts, and scaling out JEG for scalability. #732

Process starvation in concurrent kernel starts, and scaling out JEG for scalability. #732

esevan commented Sep 10, 2019 •

edited

Loading

kevin-bates commented Sep 10, 2019

esevan commented Sep 19, 2019

kevin-bates commented Sep 19, 2019

esevan commented Sep 20, 2019 •

edited

Loading

lresende commented Sep 20, 2019

Process starvation in concurrent kernel starts, and scaling out JEG for scalability. #732

Process starvation in concurrent kernel starts, and scaling out JEG for scalability. #732

Comments

esevan commented Sep 10, 2019 • edited Loading

Description

Environment

kevin-bates commented Sep 10, 2019

esevan commented Sep 19, 2019

kevin-bates commented Sep 19, 2019

esevan commented Sep 20, 2019 • edited Loading

lresende commented Sep 20, 2019

esevan commented Sep 10, 2019 •

edited

Loading

esevan commented Sep 20, 2019 •

edited

Loading