Add support for Intel MPI #389

alculquicondor · 2021-07-28T20:07:07Z

Adds the field .spec.mpiImplementation, defaults to OpenMPI
The Intel implementation requires a Service fronting the launcher
Passing the number of slots through environment variable instead of hostfile (some versions of Intel MPI ignore the slots defined in the hostfile).
Add an example that uses Intel MPI

Intel MPI is very flaky at startup, as opposed to OpenMPI. In particular, it won't retry connections to workers if they are the hostnames are not resolvable. The entrypoint handles this.

terrytangyuan

Others who have experience in this please help review this one.

alculquicondor · 2021-07-28T21:25:47Z

cc @kawych

gaocegege · 2021-07-29T02:31:21Z

cc @zidarko

gaocegege

/cc @zw0610 @carmark

v2/pkg/apis/kubeflow/v2beta1/types.go

kawych

LGTM. Would be great if you could document what Intel MPI versions was it tested with.

kawych · 2021-07-29T09:12:48Z

v2/pkg/controller/mpi_job_controller.go

-func (c *MPIJobController) getOrCreateWorkersService(mpiJob *kubeflow.MPIJob) (*corev1.Service, error) {
-	svc, err := c.serviceLister.Services(mpiJob.Namespace).Get(mpiJob.Name + workerSuffix)
+func (c *MPIJobController) getOrCreateWorkersService(job *kubeflow.MPIJob) (*corev1.Service, error) {
+	return c.getOrCreateService(job, newWorkersService)


Can you simplify the code by creating the service here and removing the newWorkersService() factory function? And same with the lanucher?

I'm getting rid of the factory in the getOrCreateService function. But I have to keep the newWorkersService because it's used for tests.

kawych · 2021-07-29T09:57:35Z

examples/pi/intel-entrypoint.sh

+  source $set_intel_vars
+fi
+
+function resolve_host() {


I think this still can cause issues sometimes, i.e. sometimes there is a window when the launcher is able to resolve it's own hostname, but workers are not. This happens to me usually in the second run if I schedule two runs in a row.

It should be fine though, since this is only an example

Are you sure that was the problem?
When I only had the check for the launcher, I was getting flaky startups. Now that I have checks for all the workers as well, the job starts every time. I couldn't really debug what was going on with just the check for the launcher, as Hydra doesn't log the output of the ssh calls :(

Adds the field .spec.mpiImplementation, defaults to OpenMPI The Intel implementation requires a Service fronting the launcher.

alculquicondor · 2021-07-30T15:22:07Z

Any other suggestions?

gaocegege · 2021-08-01T14:00:45Z

LGTM 👍
/lgtm

alculquicondor · 2021-08-03T13:17:43Z

/assign @terrytangyuan
for approval

terrytangyuan

Thanks!

/lgtm
/approve

google-oss-robot · 2021-08-03T18:23:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kawych, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-robot requested review from gaocegege and zw0610 July 28, 2021 20:07

google-oss-robot added the size/XL label Jul 28, 2021

alculquicondor force-pushed the intel branch from 36056ff to e9c2bf3 Compare July 28, 2021 20:36

alculquicondor mentioned this pull request Jul 28, 2021

Implement v2 controller that sets up SSH for communication #373

Open

16 tasks

terrytangyuan reviewed Jul 28, 2021

View reviewed changes

gaocegege reviewed Jul 29, 2021

View reviewed changes

google-oss-robot requested a review from carmark July 29, 2021 02:32

gaocegege reviewed Jul 29, 2021

View reviewed changes

v2/pkg/apis/kubeflow/v2beta1/types.go Outdated Show resolved Hide resolved

kawych approved these changes Jul 29, 2021

View reviewed changes

alculquicondor added 2 commits July 29, 2021 12:08

Add support for Intel MPI

d1c3b7b

Adds the field .spec.mpiImplementation, defaults to OpenMPI The Intel implementation requires a Service fronting the launcher.

Add an example image that uses Intel MPI

0f14cc6

alculquicondor force-pushed the intel branch from e9c2bf3 to 0f14cc6 Compare July 29, 2021 16:09

google-oss-robot assigned gaocegege Aug 1, 2021

google-oss-robot added the lgtm label Aug 1, 2021

google-oss-robot assigned terrytangyuan Aug 3, 2021

terrytangyuan approved these changes Aug 3, 2021

View reviewed changes

google-oss-robot added the approved label Aug 3, 2021

google-oss-robot merged commit 990bf1c into kubeflow:master Aug 3, 2021

alculquicondor mentioned this pull request Aug 10, 2021

Fix Discovery script for intel #397

Merged

This was referenced May 17, 2023

Add default Intel MPI env variables to MPIJob kubeflow/training-operator#1804

Merged

IntelMPI support kubeflow/training-operator#1807

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Intel MPI #389

Add support for Intel MPI #389

alculquicondor commented Jul 28, 2021 •

edited

Loading

terrytangyuan left a comment •

edited

Loading

alculquicondor commented Jul 28, 2021

gaocegege commented Jul 29, 2021

gaocegege left a comment

kawych left a comment

kawych Jul 29, 2021

alculquicondor Jul 29, 2021

kawych Jul 29, 2021

alculquicondor Jul 29, 2021 •

edited

Loading

alculquicondor commented Jul 30, 2021

gaocegege commented Aug 1, 2021

alculquicondor commented Aug 3, 2021

terrytangyuan left a comment

google-oss-robot commented Aug 3, 2021

Add support for Intel MPI #389

Add support for Intel MPI #389

Conversation

alculquicondor commented Jul 28, 2021 • edited Loading

terrytangyuan left a comment • edited Loading

Choose a reason for hiding this comment

alculquicondor commented Jul 28, 2021

gaocegege commented Jul 29, 2021

gaocegege left a comment

Choose a reason for hiding this comment

kawych left a comment

Choose a reason for hiding this comment

kawych Jul 29, 2021

Choose a reason for hiding this comment

alculquicondor Jul 29, 2021

Choose a reason for hiding this comment

kawych Jul 29, 2021

Choose a reason for hiding this comment

alculquicondor Jul 29, 2021 • edited Loading

Choose a reason for hiding this comment

alculquicondor commented Jul 30, 2021

gaocegege commented Aug 1, 2021

alculquicondor commented Aug 3, 2021

terrytangyuan left a comment

Choose a reason for hiding this comment

google-oss-robot commented Aug 3, 2021

alculquicondor commented Jul 28, 2021 •

edited

Loading

terrytangyuan left a comment •

edited

Loading

alculquicondor Jul 29, 2021 •

edited

Loading