kubeflow · google-oss-robot · Jun 21, 2021 · May 17, 2021 · May 18, 2021 · carmark
diff --git a/proposals/scalable-robust-operator.md b/proposals/scalable-robust-operator.md
@@ -0,0 +1,284 @@
+# A scalable and robust operator
+
+Authors: @alculquicondor, @ahg-g
+
+- [Motivation](#motivation)
+- [Goals](#goals)
+- [Background](#background)
+- [Design](#design)
+- [Alternatives Considered](#alternatives-considered)
+- [Appendix](#appendix-prototype-objects)
+
+## Motivation
+
+A scalable MPI setup on Kubernetes is important for:
+- Running jobs from multiple users in a single cluster.
+- High-performance computing jobs that can require a big number of workers.
+
+A robust MPI setup should be tolerant to failures, implementing retries while
+keeping track of failures.
+
+## Goals
+
+- Allow driver-to-worker control to scale by removing kube-apiserver from
+  the communication channel.
+- Reduce the complexity of the controller by relaying on Kubernetes workload
+  APIs for Pod creation and management.
+
+## Background
+
+The original design of the operator can be found as a
+[proposal to the kubeflow community](https://github.com/kubeflow/community/blob/master/proposals/mpi-operator-proposal.md).
+The latest release includes a v1alpha2 API and controller.
+A v1 is under development. Something to highlight in this new version is the
+replacement of the Job and StatefulSet by plain Pods, with the intent of
+[tracking running Pods](https://github.com/kubeflow/mpi-operator/issues/201#issuecomment-827837831).
+
+An MPIJob CRD describes the Job. Important fields include:
+- The workers template
+- The number of workers
+- The launcher template, which should have a `mpirun` command.
+
+The images are expected to have the MPI implementation binaries (such as
+OpenMPI, MPICH or Intel MPI) the user’s MPI executable.
+
+A controller processes the MPIJob, starting a Job with the following steps:
+1. Creates ConfigMap, which contains:
+  - A script `kubexec.sh` that wraps `kubectl exec` and is used in replacement
+    of `ssh`. This script, before executing the command provided by `mpirun`,
+    transfers a file containing a mapping of pod names to IPs and appends it to
+    the worker’s `/etc/hosts`.
+
+    Note: The v1 controller no longer copies the pod-to-IP mapping file. The
+    OpenMPI implementation does the [routing](https://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3).
+  - The `hostfile` for `mpirun`, listing the worker pod names and number of
+    slots (which could be the number of CPUs or GPUs). This list is built
+    programmatically.
+2. Creates a ServiceAccount+Role+RoleBinding for the launcher, which allow it
+  to:
+  - get/list/watch on Pods
+  - do pods/exec
+  This allows the launcher Pod to obtain details of the worker Pods and start
+  the process managers on them.
+3. If configured, it creates a Volcano PodGroup
+4. Creates a StatefulSet for workers (plain pods in v1). The Pod template
+  includes:
+  - mounts for the ConfigMap.
+  - `sleep` as command.
+5. Creates launcher Job (plain pod in v1). The Pod template includes:
+  - An init container, `kubectl-delivery`, described below.
+  - Environment variables for:
+    - replacing `ssh` for `kubexec.sh`
+    - the `hostfile` location
+  - Volumes for:
+    - The ConfigMap
+    - Sharing files from `kubectl-delivery`
+
+The launcher Job, as previously mentioned, contains an init container:
+`kubectl-delivery`. This is a Kubernetes controller that watches pods. It does
+the following:
+
+1. Copy kubectl from the image into the volume shared with the main container.
+2. Wait for all Pods in the hostfile to be running
+3. Generates a file mapping pod name to IP, in `/etc/hosts` format.
+
+To update the status of an MPIJob, the controller uses the status of the
+launcher Job. That is, when the launcher Job fails or succeeds, it’s status is
+copied to the MPIJob. In v1, it bases the status on the termination condition of
+the Pod.
+
+### Analysis
+
+The above architecture for MPI Jobs has the following problems:
+
+- It doesn’t scale as we increase the number of workers, belonging to the same
+  or different jobs.
+  - Due to the use of `kubectl exec`, every worker spawn goes through
+    kube-apiserver. `mpirun` starts a daemon in each worker
+    (like [`orted`](https://www.open-mpi.org/doc/v3.0/man1/orted.1.php)).
+    This process handles the worker-to-worker communication, which happens
+    without the intervention of kube-apiserver. However, the `exec` connection
+    stays up for control during the entirety of the job.
+  - The `kubectl-delivery` controller does a full cache sync to be able to watch
+    Pods. This startup penalty increases with the number of pods in the cluster
+    and has to be paid for every job. The API calls also causing additional
+    stress on the apiserver as the number of launchers increases.
+  - The launcher pod has execution permissions on any other Pod in the
+    namespace, which can be a security concern.
+  - The v1 controller doesn’t implement launcher pod retries, although there are
+    plans to. So the MPIJob behaves like a plain Pod in this version.
+
+## Design
+
+In order to address the problems of the existing architecture, we propose the
+following changes:
+
+- **The use of `ssh` instead of `kubectl exec`.**
+
+  This would avoid any pod-to-pod communication happening through apiserver and
+  doesn’t require giving execution permissions at the namespace level.
+  This can be achieved like this:
+  - The controller can generate a single key and share it with the launcher and
+    worker pods through a Secret.
+  - The launcher and workers mount the Secret and set appropriate file
+    permissions.
+  - The workers run an SSH server instead of `sleep`.
+  - When encrypted communication is not a requirement, users could have the
+    choice to use `rsh` for faster communication.
+
+- **The use of stable hostnames and a headless Service for the workers**
+  - This removes the need to query Pod IPs, as the Pods can discover each other
+    through DNS resolution. The hostfile can be generated statically by the
+    controller using the stable hostnames.
+  - Starting with k8s 1.22, we can use
+    [Indexed Jobs with stable hostnames](https://git.k8s.io/enhancements/keps/sig-apps/2214-indexed-job)
+    to delegate the pod management to Kubernetes. Additionally, this will give
+    us robust failure tracking so that we can give users control over retry
+    limits.
+    In the meantime, we can continue using plain Pods.
+    - Caveat 1: The Job controller doesn’t report the number of running Pods.
+      Instead, it reports active Pods, which include running and pending
+      (scheduling or starting). But Kubernetes SIG Apps is
+      [open to add a status field for running Pods](https://kubernetes.slack.com/archives/C18NZM5K9/p1619549430067400).
+    - Caveat 2: Horovod supports elastic workers, but the Kubernetes Job
+      doesn’t support changes to the completions field. This can be supported
+      starting from 1.23. In the meantime, we can replicate the behavior by
+      creating a new Job and doing Pod adoption.
+  - For Intel MPI, we also need a headless Service to front the launcher.
+- **Revert the use of the Job API for the launcher.**
+  - The Job controller handles retries when the launcher or any of the workers fail.
+  - Caveat 1 also applies: The Job controller doesn’t report if the Pod is running.
+    We can continue watching Pods in the meantime.
+- With the above changes, **the following objects can be removed**:
+  - The ServiceAccount+Role+RoleBinding for the launcher.
+  - The `kubectl-delivery` init container in the launcher, as there is no need
+    to obtain IPs, speeding up startup time.
+
+## Alternatives Considered
+
+TBD from discussions
+
+## Appendix: Prototype objects
+
+It uses a StatefulSet in place of Indexed Jobs, as they are still an alpha
+feature in Kubernetes.
+
+```yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: mpi-config
+data:
+  hostfile: |
+    mpi-workers-0.mpi-workers slots=3
+    mpi-workers-1.mpi-workers slots=3
+    mpi-workers-2.mpi-workers slots=3
+```
+
+```yaml
+apiVersion: v1
+kind: Secret
+type: kubernetes.io/ssh-auth
+data:
+  ssh-privatekey: PRIVATE_KEY
+  ssh-publickey: PUBLIC_KEY
+```
+
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: mpi-launcher
+spec:
+  template:
+    spec:
+      restartPolicy: OnFailure
+      containers:
+      - name: driver
+        image: '<USER_IMAGE>'
+        args:
+        - 'mpirun'
+        - '-np'
+        - '9'
+        - '<USER_EXECUTABLE>'
+        env:
+        - name: 'OMPI_MCA_orte_keep_fqdn_hostnames'
+          value: 'true'
+        - name: 'OMPI_MCA_orte_default_hostfile'
+          value: '/home/mpiuser/config/hostfile'
+        volumeMounts:
+        - name: ssh-auth
+          mountPath: /mnt/ssh
+          readOnly: true
+        - name: mpi-config
+          mountPath: /home/mpiuser/config
+          readOnly: true
+      volumes:
+      - name: mpi-config
+        configMap:
+          name: mpi-config
+      - name: ssh-auth
+        secret:
+          secretName: ssh-auth
+          items:
+          - key: ssh-privatekey
+            path: id_rsa
+          - key: ssh-publickey
+            path: id_rsa.pub
+          - key: ssh-publickey
+            path: authorized_keys
+```
+
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: mpi-workers
+spec:
+  clusterIP: None
+  selector:
+    app: mpi-workers
+```
+
+```yaml
+apiVersion: apps/v1
+kind: StatefulSet
+metadata:
+  name: mpi-workers
+spec:
+  selector:
+    matchLabels:
+      app: mpi-workers
+  serviceName: mpi-workers
+  replicas: 3
+  podManagementPolicy: Parallel
+  template:
+    metadata:
+      labels:
+        app: mpi-workers
+    spec:
+      containers:
+      - name: worker
+        image: '<USER_IMAGE>'
+        volumeMounts:
+        - name: ssh-auth
+          mountPath: /mnt/ssh
+          readOnly: true
+        - name: mpi-config
+          mountPath: /home/mpiuser/config
+          readOnly: true
+      volumes:
+      - name: mpi-config
+        configMap:
+          name: mpi-config
+      - name: ssh-auth
+        secret:
+          secretName: ssh-auth
+          items:
+          - key: ssh-privatekey
+            path: id_rsa
+          - key: ssh-publickey
+            path: id_rsa.pub
+          - key: ssh-publickey
+            path: authorized_keys
+```