New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[openmpi]- NodeSelector not working. #1230
Comments
/area openmpi |
/cc @jiezhang |
Sorry about the confusion. The syntax is comma separated list of key=value pairs, e.g. "kubernetes.io/hostname=gke-x1,cloud.google.com/gke-accelerator=nvidia-tesla-k80". The conditions are AND'ed together. Unfortunately we don't support IN operator in the expression at the moment. We need to figure out how to support IN operator here: https://github.com/kubeflow/kubeflow/blob/master/kubeflow/openmpi/util.libsonnet One workaround I can think of is to label your nodes such that gke-x1 and gke-x1 share one common lable and use that as NODE_SELECTOR (e.g. "node-type=gke"). |
If you're using gcloud, you can specify node labels this way:
|
Oh yeah my nodes are already created. Will try to update lables for those nodes and use it. |
/assign @jiezhang |
The nodes are labeled now and we have:
|
Looks like NodeSelector does not support IN operator. We need to support Affinity if you want "hostname IN (gke-x1, gke-x2)". I'm closing this for now. Please open a feature request if you need the feature. /close |
* Fix a bunch issues with GCP blueprints for private gke. * Tracking issue GoogleCloudPlatform/kubeflow-distribution#33 * Fix the setters on firewall rules. They should be partial setters so we don't lose the suffixes. * Add a firewall rule to allow cert-manager webhooks this is necessary to work with private GKE ref https://docs.cert-manager.io/en/release-0.11/getting-started/webhook.html#running-on-private-gke-clusters * Add kpt/kustomize function to configure the transform to replace images with the mirror'd image versions. * Update image mirroring configs * Instead of using "*" to match all images we list out image prefixes to match so we are a bit more intentional. * We want to include gcr.io images in order to support working with VPC-SC. For VPC-SC gcr.io images need to be mirror'd as well because they are unlikely to be within the perimeter * Use the locations gcr.io/${PROJECT}/mirror It looks like the mirror'ing pipeline includes the registry name * Change the release channel on the cluster to be upper case * Per GoogleCloudPlatform/k8s-config-connector#194 we need release channels to be upper case otherwise updates fail. * centraldashboard v3 kustomization.yaml needs an image stanza * Without this we end up deploying using tag "latest" which isn't what we want. * Use CNRM to enable services GoogleCloudPlatform/kubeflow-distribution#31 * Remove cert-manager ACME challenge from excluded paths for JWT validation * We no longer use cert-manager so we no longer need to allow that path. * We need to add a default network route in order to allow cloudnat to access the outbound interet access * Need to access jwks * Give routes and nat resources unique names based on the KF name. * Route to public internet should be higher priority so google apis take precedence. * * Regenerate tests.
As mentioned here in
GPU Training
section trying to use--nodeSelector
to specify pool of nodes to launch my workloads.But
--nodeSelector
is not working as expected.--nodeSelector=<nodeSelector> Comma-delimited list of "key=value" pairs to select the worker nodes. e.g. "cloud.google.com/gke-accelerator=nvidia-tesla-k80" [default: null, type: string]
Above description is bit confusing, mentioned type as
string
but when explaining written as comma -delimited list (Typo need to be corrected)Tried as below.
if type is
array
NODE_SELECTOR=
'["kubernetes.io/hostname=gke-x1", "kubernetes.io/hostname=gke-x2"]'
- It didn't honor, randomly picked up some nodes.or
if type is
string
NODE_SELECTOR=
"kubernetes.io/hostname=gke-x1, gke-x2"
. It gave me below errorFound out during discussion #838
@everpeace
The text was updated successfully, but these errors were encountered: