Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[openmpi]- NodeSelector not working. #1230

Closed
karthikvadla opened this issue Jul 18, 2018 · 8 comments
Closed

[openmpi]- NodeSelector not working. #1230

karthikvadla opened this issue Jul 18, 2018 · 8 comments
Assignees
Labels

Comments

@karthikvadla
Copy link

As mentioned here in GPU Training section trying to use --nodeSelector to specify pool of nodes to launch my workloads.

But --nodeSelector is not working as expected.

--nodeSelector=<nodeSelector> Comma-delimited list of "key=value" pairs to select the worker nodes. e.g. "cloud.google.com/gke-accelerator=nvidia-tesla-k80" [default: null, type: string]

Above description is bit confusing, mentioned type as string but when explaining written as comma -delimited list (Typo need to be corrected)

Tried as below.

if type is array
NODE_SELECTOR='["kubernetes.io/hostname=gke-x1", "kubernetes.io/hostname=gke-x2"]' - It didn't honor, randomly picked up some nodes.

or
if type is string
NODE_SELECTOR="kubernetes.io/hostname=gke-x1, gke-x2". It gave me below error

level=error msg="find objects: RUNTIME ERROR: Index 1 out of bounds, not within [0, 1)\n\t/Users/kvadla/git-repo/mlt/examples/horovod_unet/mlt_app/unet-horovod-2e6fbba6-7e39-41f9-98b5-1f3241001e33/vendor/kubeflow/openmpi/util.libsonnet:14:46-81\tobject \n\t:1222:30-34\tthunk from <thunk from <function >>\n\t:1216:24-25\tthunk from <thunk from <function >>\n\t\tbuiltin function \n\t:1217:8-9\tfunction \n\t:1165:25-26\tthunk from <thunk from <function >>\n\t\tbuiltin function \n\t:1167:29-31\tthunk from <function >\n\t\tbuiltin function \n\t:1167:8-36\tfunction \n\t\t\n\t:1222:20-35\tthunk from <function >\n\t:1208:10-11\tfunction \n\t:1165:25-26\tthunk from <thunk from <function >>\n\t\tbuiltin function \n\t:1167:29-31\tthunk from <function >\n\t\tbuiltin function \n\t:1167:8-36\tfunction \n\t\t\n\t:1222:10-36\tfunction \n\t...\n\t\tbuiltin function \n\t:1167:29-31\tthunk from <function >\n\t\tbuiltin function \n\t:1167:8-36\tfunction \n\t\t\n\t:1218:35-56\tfunction \n\t\tbuiltin function \n\t:1222:20-35\tthunk from <function >\n\t:1208:10-11\tfunction \n\t:1165:25-26\tthunk from <thunk from <function >>\n\t\tbuiltin function \n\t:1167:29-31\tthunk from <function >\n\t\tbuiltin function \n\t:1167:8-36\tfunction \n\t\t\n\t:1222:10-36\tfunction \n\t\tbuiltin function \n\t\tbuiltin function <$objectFlatMerge>\n\textvar:__ksonnet/components:7:3-58\tobject \n\tDuring manifestation\t\n"
make: *** [deploy] Error 1

Found out during discussion #838
@everpeace

@karthikvadla
Copy link
Author

karthikvadla commented Jul 18, 2018

/area openmpi
/priority p1

@everpeace
Copy link
Contributor

/cc @jiezhang
Would you mind taking a look at this??

@jiezhang
Copy link

Sorry about the confusion. The syntax is comma separated list of key=value pairs, e.g. "kubernetes.io/hostname=gke-x1,cloud.google.com/gke-accelerator=nvidia-tesla-k80". The conditions are AND'ed together.

Unfortunately we don't support IN operator in the expression at the moment. We need to figure out how to support IN operator here: https://github.com/kubeflow/kubeflow/blob/master/kubeflow/openmpi/util.libsonnet

One workaround I can think of is to label your nodes such that gke-x1 and gke-x1 share one common lable and use that as NODE_SELECTOR (e.g. "node-type=gke").

@jiezhang
Copy link

If you're using gcloud, you can specify node labels this way:

gcloud container node-pools create ... --node-labels=node-type=gke

@karthikvadla
Copy link
Author

Oh yeah my nodes are already created. Will try to update lables for those nodes and use it.
Thanks @jiezhang

@jiezhang
Copy link

/assign @jiezhang

@ashahba
Copy link
Member

ashahba commented Jul 18, 2018

The nodes are labeled now and we have:

$ kubectl get nodes --selector=node-type=highmem
NAME                                            STATUS    ROLES     AGE       VERSION
gke-xxx-xx-xxxxxxxx-xxxx   Ready     <none>    53d       v1.10.2-gke.3
gke-yyy-yy-yyyyyyyy-yyyy   Ready     <none>    53d       v1.10.2-gke.3
gke-zzz-zz-zzzzzzzz-zzzz   Ready     <none>    53d       v1.10.2-gke.3

@jiezhang
Copy link

Looks like NodeSelector does not support IN operator. We need to support Affinity if you want "hostname IN (gke-x1, gke-x2)".

I'm closing this for now. Please open a feature request if you need the feature.

/close

yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
* Fix a bunch issues with GCP blueprints for private gke.

* Tracking issue GoogleCloudPlatform/kubeflow-distribution#33

* Fix the setters on firewall rules. They should be partial setters so
  we don't lose the suffixes.

* Add a firewall rule to allow cert-manager webhooks this is necessary
  to work with private GKE

  ref https://docs.cert-manager.io/en/release-0.11/getting-started/webhook.html#running-on-private-gke-clusters

* Add kpt/kustomize function to configure the transform to replace
  images with the mirror'd image versions.

* Update image mirroring configs

  * Instead of using "*" to match all images we list out image prefixes
    to match so we are a bit more intentional.

  * We want to include gcr.io images in order to support working with
    VPC-SC. For VPC-SC gcr.io images need to be mirror'd as
    well because they are unlikely to be within the perimeter

  * Use the locations gcr.io/${PROJECT}/mirror
    It looks like the mirror'ing pipeline includes the registry name

* Change the release channel on the cluster to be upper case

  * Per GoogleCloudPlatform/k8s-config-connector#194
we need release channels to be upper case otherwise updates fail.

* centraldashboard  v3 kustomization.yaml needs an image stanza
  * Without this we end up deploying using tag "latest" which isn't
  what we want.

* Use CNRM to enable services GoogleCloudPlatform/kubeflow-distribution#31

* Remove cert-manager ACME challenge from excluded paths for JWT
  validation
  * We no longer use cert-manager so we no longer need to allow that
    path.

* We need to add a default network route in order to allow cloudnat to
  access the outbound interet access
  * Need to access jwks

* Give routes and nat resources unique names based on the KF name.

* Route to public internet should be higher priority so google apis take precedence.

* * Regenerate tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants