Propose a new architecture with focus on scalability and robustness #360

alculquicondor · 2021-05-17T20:34:54Z

Fixes #315

proposals/scalable-robust-operator.md

carmark · 2021-05-18T09:35:04Z

proposals/scalable-robust-operator.md

+- With the above changes, **the following objects can be removed**:
+  - The ServiceAccount+Role+RoleBinding for the launcher.
+  - The `kubectl-delivery` init container in the launcher, as there is no need
+    to obtain IPs, speeding up startup time.


Do u mean the init-container kubectl-delivery will be removed? If that, how could we keep the launcher start after the workers?

There are 2 options:

Do nothing, leave such handling to the gang scheduler. For this reason it is important that the launcher can do retries.

To not create the launcher until all the worker pods are running. So everything is handled by the controller, no need for extra cache syncs.

I prefer option 1.

I don't think so, you would like to speed up startup time, but it does not if you want the launcher to retry for failures. And the scheduler should not know about the job startup strategy.

By the same time, how do you verify the real launcher failed reason for option 1? for example,

workers does not startup, the launcher failed and restart

training job failed, the launcher failed.

What do you mean by "job startup strategy"?

As for differentiating failures, I don't think we can or should do that. There is another type of failure which is kind of a mix of the two you described: a worker pod gets evicted by kubelet after the job already started. We cannot easily differentiate this one from a case where the user's code had a crash, for the purpose of retrying.

But perhaps option 2 is reasonable. WDYT?

If we want to support retries, then by product we want option 1, correct?

we can still use option 2. It might make startups less flaky. But it's not a guarantee anyways. A worker can simply just fail in between the time it reported running and the launcher started running.

carmark · 2021-05-18T09:42:48Z

proposals/scalable-robust-operator.md

+    and has to be paid for every job. The API calls also causing additional
+    stress on the apiserver as the number of launchers increases.
+  - The launcher pod has execution permissions on any other Pod in the
+    namespace, which can be a security concern.


The launcher only has the execution permission on the worker pods belonging to its job, you can find it at code

Ah, missed that. However, that doesn't scale as the number of workers increase (k8s objects have a size limit).

If we replace kubectl exec with ssh, we can also support the elastic feature, thus I do not think it is a problem for us if I understand it correctly.

proposals/scalable-robust-operator.md

zw0610 · 2021-05-18T09:49:11Z

proposals/scalable-robust-operator.md

+    stress on the apiserver as the number of launchers increases.
+  - The launcher pod has execution permissions on any other Pod in the
+    namespace, which can be a security concern.
+  - The v1 controller doesn’t implement launcher pod retries, although there are


What are user cases for launcher pod retry?

Running jobs on a spot/preemptible VMs.

In my perspective, I don't think the launcher pod should be scheduled to a spot/preemptible VMs. For the worker pods, yes, because worker pods are stateless. However, the launcher pod seems a stateful instance, whose re-restart means a new job.

Still, when running the workers in a preemptible VM, if any of them fail, the entire job fails, including the launcher. So we need launcher retries.

@alculquicondor What MPI implementation are you expecting users to use after launcher gets restarted to make sure the workers are aware and can reconnect to the new launcher?

The process goes like this:

a worker terminates unexpectedly

the launcher notices the worker failure, it closes the rest of ssh connections and terminates with a failure

the launcher restarts, launches new ssh connections to the workers

This is independent of the MPI implementation.

Still, when running the workers in a preemptible VM, if any of them fail, the entire job fails, including the launcher. So we need launcher retries.

From what I observed from other HPC system, like Slurm, for jobs that are not fault-tolerant, when some of the worker fails, it does not retry via a launcher restart. Instead, the entire job will be marked as failed with it resources released. The system will re-queue the failed job if 'retry' is demanded by the user and create a new job. Such process is compatible with the contemporary design of mpi-operator.

OTOH, it kind of goes against the declarative and fault-tolerant approach of k8s APIs, including the Job API. Retries is what kubernetes users would expect. And if they don't need it, they could always disable it.

I think it is useful if we make automatic retries an option that we default to 0.

proposals/scalable-robust-operator.md

alculquicondor

Thanks for your comments so far. Please let me know if I missed anything else in the background. I got all the details by reading code.

proposals/scalable-robust-operator.md

alculquicondor · 2021-05-18T15:56:32Z

proposals/scalable-robust-operator.md

+    and has to be paid for every job. The API calls also causing additional
+    stress on the apiserver as the number of launchers increases.
+  - The launcher pod has execution permissions on any other Pod in the
+    namespace, which can be a security concern.


proposals/scalable-robust-operator.md

gaocegege · 2021-05-19T02:07:25Z

proposals/scalable-robust-operator.md

+
+```yaml
+apiVersion: apps/v1
+kind: StatefulSet


I am wondering how to support elastic mode with statefulset? We may add/remove workers in the fly.

You just change .spec.replicas.

alculquicondor · 2021-05-20T20:13:26Z

There are 2 main open discussions pending:

Should we hold the creation of the launcher until the workers are all ready? Propose a new architecture with focus on scalability and robustness #360 (comment)
Are retries necessary? Propose a new architecture with focus on scalability and robustness #360 (comment)

Please leave your thoughts

alculquicondor · 2021-05-31T17:23:04Z

I asked for 10 min in the next AutoML and Trainging WG meeting. Would you be able to make it @terrytangyuan @gaocegege ?

terrytangyuan · 2021-06-01T15:31:13Z

I cannot make it due to scheduling conflicts. In the meantime, I'd encourage all existing reviewers to check if your comments can be resolved or discuss if there are further concerns on the proposal (which might be more efficient as we are located in more than 4 timezones).

ahg-g · 2021-06-01T16:09:06Z

I cannot make it due to scheduling conflicts. In the meantime, I'd encourage all existing reviewers to check if your comments can be resolved or discuss if there are further concerns on the proposal (which might be more efficient as we are located in more than 4 timezones).

+1. Can we also assign an "approver"?

terrytangyuan · 2021-06-01T17:56:19Z

We should try to reach consensus from people listed in OWNERS. People from @kubeflow/wg-training-leads can also approve after that.

alculquicondor · 2021-06-01T19:57:35Z

@rongou any thoughts?

gaocegege · 2021-06-02T02:12:12Z

@carmark @zw0610 @Jeffwan PTAL thanks.

alculquicondor · 2021-06-02T13:17:26Z

One topic that was raised at today's meeting was where to place the code. This question is important because there are breaking changes and the new controller wouldn't be able to properly process an existing job.

I was thinking that we can just modify v1 code, as it hasn't been released, AFAIK. Is this not the case? Should we create a new folder? like v1-ssh or something like that? My worry is that this would become a maintenance problem, as now changes would have to be applied in 2 places. And long term, the existing v1 code should be removed. But if v1 was never released, it should be safe to just replace it.

terrytangyuan · 2021-06-02T18:53:17Z

It's true that v1 is not officially released yet but I believe there are several companies that are already running it in production which might be concerning.

alculquicondor · 2021-06-04T17:16:06Z

I see. Then maybe we can create the new controller in a entire new module, so that we can start with fresh dependencies and not be blocked by kube-batch's (see #364)

We should try to reach consensus from people listed in OWNERS. People from @kubeflow/wg-training-leads can also approve after that.

If they don't respond, can we assume lazy consensus?

zw0610 · 2021-06-05T02:55:29Z

It's true that v1 is not officially released yet but I believe there are several companies that are already running it in production which might be concerning.

I believe the API version v1, v1alpha1 and v1alpha2 is not necessary bounded to the controller version, which means, if this proposal wish to add a kind of new controller to the v1 API, all the contributors need to do is add the implementation under pkg/controllers/v1 but in a different file like pkg/controllers/v1/mpi_job_alternative_controller.go. After the implementation is done, you can add option in cmd/mpi-operator.v1 to let user choose which controller to use. In this way, we can avoid users who are already using the v1 API can the existing controller implementation.

Also, as attendees suggested on the Kubeflow training meeting, it would be better to present the metric showing the new design does improve the scalability and launching time before this proposal is adopted.

alculquicondor · 2021-06-07T13:10:54Z

The problem is not necessarily the API. Let's say someone has running jobs with the existing v1 controller. If they upgrade to the new proposed controller, the existing jobs would have orphan resources that the new controller won't manage. And the existing workers wouldn't work, because they don't have stable hostnames.

alculquicondor · 2021-06-08T14:47:00Z

In any case, I think the decision of where to place the code comes later. Can we have an agreement on the direction of the new architecture? See #360 (comment)

terrytangyuan · 2021-06-11T15:00:21Z

Also, as attendees suggested on the Kubeflow training meeting, it would be better to present the metric showing the new design does improve the scalability and launching time before this proposal is adopted.

I think @zw0610 raised a good point here. The proposed changes are very significant. I think in this case it makes sense to present evidence of the improvements (perhaps benchmarking on a fork) before we merge it back to the upstream repo. This way we are all confident that the changes are beneficial to the community.

@alculquicondor I think you've put together a good starting point in the "analysis" section of the proposal. It would be great to see some real benchmarks/metrics so it's more convincing in practice instead of theoretically.

alculquicondor · 2021-06-11T15:09:20Z

Thanks, I'll work on that throughout the week.

alculquicondor · 2021-06-15T16:10:27Z

I ran some experiments with these characteristics:

21 nodes
600 running pods (not part of the job)
Job with 3 slots per worker
All nodes already have the image (so no image pull time involved)

Then I run two jobs, one with 3 workers and one with 20 workers. Each task is doing:

print its rank
Execute a simple calculation
Do MPI_Reduce targeting rank 0

So really, most of the time is spent on setup, but some communication is being tested too.

For each run, I recorded the Start time and Completion time as reported by the mpi-operator controller.

For 3 workers, I got an average of 9s for the current operator and 3s for the ssh proposal.
For 20 workers, I got an average of 19s for the current operator and 11s for the ssh proposal.

All runs here:

You can see an important improvement. I used a reasonably small cluster and somewhat small jobs. Let me know if you would like any more details about my quick experiments.

However, let me reiterate on theoretical analysis already presented, which I think is important on its own. The apiserver is a critical component in the cluster. We should avoid stressing it with requests that intend to do pod-to-pod communication. The apiserver has limits for ongoing open connections and it throttles API requests. With the current architecture, just a few dozens of jobs could cause starvation to API requests from critical components such as the scheduler or the controller manager.

alculquicondor · 2021-06-18T13:44:10Z

@terrytangyuan @zw0610 anything to add?

terrytangyuan · 2021-06-18T14:05:28Z

Thanks for putting this together! I think both theoretical and analytical improvements look great to me.

As it's already close to the weekend, please leave additional comments by June 22nd and I will merge this if there's no major concerns by then.

johnugeorge · 2021-06-18T19:00:20Z

I feel that this proposal is a strong case for a revamp based on #360 (comment).

/lgtm

gaocegege · 2021-06-21T06:27:09Z

/lgtm

Thanks for the proposal.

terrytangyuan · 2021-06-21T17:34:49Z

/approve

google-oss-robot · 2021-06-21T17:34:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Propose a new architecture with focus con scalability and robustness

88afc08

google-oss-robot requested review from gaocegege and zw0610 May 17, 2021 20:35

google-oss-robot added the size/L label May 17, 2021

alculquicondor mentioned this pull request May 17, 2021

[Discussion] Do we need the kubeflow dependency #341

Open

ahg-g reviewed May 17, 2021

View reviewed changes

carmark reviewed May 18, 2021

View reviewed changes

zw0610 reviewed May 18, 2021

View reviewed changes

proposals/scalable-robust-operator.md Outdated Show resolved Hide resolved

zw0610 reviewed May 18, 2021

View reviewed changes

proposals/scalable-robust-operator.md Outdated Show resolved Hide resolved

terrytangyuan changed the title ~~Propose a new architecture with focus con scalability and robustness~~ Propose a new architecture with focus on scalability and robustness May 18, 2021

alculquicondor commented May 18, 2021

View reviewed changes

incorporate comments and fix assumptions about security

e96473a

alculquicondor force-pushed the ssh-proposal branch from cb33c59 to e96473a Compare May 18, 2021 16:16

terrytangyuan reviewed May 18, 2021

View reviewed changes

proposals/scalable-robust-operator.md Show resolved Hide resolved

gaocegege reviewed May 19, 2021

View reviewed changes

alculquicondor mentioned this pull request Jun 2, 2021

Graduate MPI Operator to v1 #217

Open

alculquicondor mentioned this pull request Jun 3, 2021

Upgrade mod dependencies #364

Closed

google-oss-robot assigned johnugeorge Jun 18, 2021

google-oss-robot added the lgtm label Jun 18, 2021

google-oss-robot assigned gaocegege Jun 21, 2021

terrytangyuan approved these changes Jun 21, 2021

View reviewed changes

google-oss-robot added the approved label Jun 21, 2021

google-oss-robot merged commit 39d2108 into kubeflow:master Jun 21, 2021

alculquicondor mentioned this pull request Jun 21, 2021

Fork v2 controller and API in a new module #366

Merged

alculquicondor mentioned this pull request Jul 27, 2021

Allow launcher to start after workers are ready #386

Open

Propose a new architecture with focus on scalability and robustness #360

Propose a new architecture with focus on scalability and robustness #360

Conversation

alculquicondor commented May 17, 2021

Choose a reason for hiding this comment

alculquicondor May 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor May 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carmark May 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor May 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented May 20, 2021

alculquicondor commented May 31, 2021

terrytangyuan commented Jun 1, 2021

ahg-g commented Jun 1, 2021

terrytangyuan commented Jun 1, 2021

alculquicondor commented Jun 1, 2021

gaocegege commented Jun 2, 2021

alculquicondor commented Jun 2, 2021

terrytangyuan commented Jun 2, 2021

alculquicondor commented Jun 4, 2021

zw0610 commented Jun 5, 2021 • edited Loading

alculquicondor commented Jun 7, 2021

alculquicondor commented Jun 8, 2021

terrytangyuan commented Jun 11, 2021

alculquicondor commented Jun 11, 2021

alculquicondor commented Jun 15, 2021

alculquicondor commented Jun 18, 2021

terrytangyuan commented Jun 18, 2021 • edited Loading

johnugeorge commented Jun 18, 2021

gaocegege commented Jun 21, 2021

terrytangyuan commented Jun 21, 2021

google-oss-robot commented Jun 21, 2021

alculquicondor May 18, 2021 •

edited

Loading

alculquicondor May 19, 2021 •

edited

Loading

carmark May 18, 2021 •

edited

Loading

alculquicondor May 18, 2021 •

edited

Loading

zw0610 commented Jun 5, 2021 •

edited

Loading

terrytangyuan commented Jun 18, 2021 •

edited

Loading