Skip to content

Commit

Permalink
Fix default label for Training Operators (#1808)
Browse files Browse the repository at this point in the history
* Fix default label for Training Operators

* Fix version comment

* Change the docs

* Change git command
  • Loading branch information
andreyvelich committed Feb 15, 2022
1 parent adfba2f commit 6a36763
Show file tree
Hide file tree
Showing 7 changed files with 9 additions and 8 deletions.
2 changes: 1 addition & 1 deletion docs/proposals/metrics-collector.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ In the namespace with `katib.kubeflow.org/metrics-collector-injection=enabled` l

In **Pod Level Injecting**,

1. Job operators (_e.x. TFjob/PyTorchjob_) tag the `job-role: master` ([#1064](https://github.com/kubeflow/tf-operator/pull/1064)) label on the master pod.
1. Job operators (_e.x. TFjob/PyTorchjob_) tag the `training.kubeflow.org/job-role: master` ([#1064](https://github.com/kubeflow/tf-operator/pull/1064)) label on the master pod.
2. The webhook inject the metric collector only if the webhook recognizes this label.
3. The webhook uses [ObjectSelector](https://github.com/kubernetes/kubernetes/pull/78505) to skip on irrelevant objects in order to optimize the performance.
4. ObjectSelector is only supported above _Kubernetes v1.15_. Without this new feature, there may be a [performance issue](https://github.com/kubeflow/katib/issues/685#issuecomment-516226070) in webhook. In this situation, the following **Job Level Injecting** mode may be a better option.
Expand Down
2 changes: 1 addition & 1 deletion docs/proposals/trial-custom-crd.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ For example, for TFJob:
```yaml
. . .
PrimaryPodLabel:
"job-role": "master"
"training.kubeflow.org/job-role": "master"
. . .
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ spec:
primaryContainerName: mxnet
# In this example we can collect metrics only from the Worker pods.
primaryPodLabels:
replica-type: worker
training.kubeflow.org/replica-type: worker
trialParameters:
- name: learningRate
description: Learning rate for the training model
Expand Down
3 changes: 2 additions & 1 deletion pkg/apis/controller/experiments/v1beta1/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,13 +38,14 @@ const (

var (
// DefaultKubeflowJobPrimaryPodLabels is the default value of spec.trialTemplate.primaryPodLabels for Kubeflow Training Job.
DefaultKubeflowJobPrimaryPodLabels = map[string]string{"job-role": "master"}
DefaultKubeflowJobPrimaryPodLabels = map[string]string{"training.kubeflow.org/job-role": "master"}

// KubeflowJobKinds is the list of Kubeflow Training Job kinds.
KubeflowJobKinds = map[string]bool{
"TFJob": true,
"PyTorchJob": true,
"XGBoostJob": true,
"MXJob": true,
"MPIJob": true,
}
)
4 changes: 2 additions & 2 deletions pkg/ui/v1beta1/frontend/src/reducers/general.js
Original file line number Diff line number Diff line change
Expand Up @@ -50,14 +50,14 @@ const initialState = {
value: 'status.conditions.#(type=="Complete")#|#(status=="True")#',
description: `Condition when Trial custom resource is succeeded.
Default value for k8s BatchJob: status.conditions.#(type=="Complete")#|#(status=="True")#.
Default value for Kubeflow Job (TFJob, PyTorchJob, XGBoostJob, MXJob): status.conditions.#(type=="Succeeded")#|#(status=="True")#.`,
Default value for Kubeflow Job (TFJob, PyTorchJob, XGBoostJob, MXJob, MPIJob): status.conditions.#(type=="Succeeded")#|#(status=="True")#.`,
},
{
name: 'FailureCondition',
value: 'status.conditions.#(type=="Failed")#|#(status=="True")#',
description: `Condition when Trial custom resource is failed.
Default value for k8s BatchJob: status.conditions.#(type=="Failed")#|#(status=="True")#.
Default value for Kubeflow Job (TFJob, PyTorchJob, XGBoostJob, MXJob): status.conditions.#(type=="Failed")#|#(status=="True")#.`,
Default value for Kubeflow Job (TFJob, PyTorchJob, XGBoostJob, MXJob, MPIJob): status.conditions.#(type=="Failed")#|#(status=="True")#.`,
},
{
name: 'Retain',
Expand Down
2 changes: 1 addition & 1 deletion test/e2e/v1beta1/argo_workflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,7 @@ def create_task_template(self, task_name, exec_image, command):
},
{
"name": "EXTRA_REPOS",
"value": "kubeflow/testing@HEAD;kubeflow/manifests@v1.4-branch"
"value": "kubeflow/testing@HEAD;kubeflow/manifests@v1.5-branch"
},
# Set GOPATH to test_dir because Katib repo is located under /src/github.com/kubeflow/katib
{
Expand Down
2 changes: 1 addition & 1 deletion test/e2e/v1beta1/scripts/setup-katib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ cat "manifests/v1beta1/components/controller/katib-config.yaml"
echo "Creating Kubeflow namespace"
kubectl create namespace kubeflow

echo "Deploying training-operator from kubeflow/manifests v1.4 branch"
cd "${MANIFESTS_DIR}/apps/training-operator/upstream/overlays/kubeflow"
echo "Deploying Training Operator from kubeflow/manifests $(git rev-parse --abbrev-ref HEAD)"
kustomize build . | kubectl apply -f -

echo "Deploying Katib"
Expand Down

0 comments on commit 6a36763

Please sign in to comment.