Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support kubeflow.org/xgboostjob #1114

Merged
merged 2 commits into from Sep 15, 2023

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Sep 12, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Support kubeflow.org/xgboostjob.

Which issue(s) this PR fixes:

Part-of #297

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Support kubeflow.org/xgboostjob

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/documentation Categorizes issue or PR as related to documentation. labels Sep 12, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Sep 12, 2023

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit e4fceeb
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/650190d48e18ef0008f89e97
😎 Deploy Preview https://deploy-preview-1114--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 12, 2023
@tenzen-y
Copy link
Member Author

/hold for review

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 12, 2023
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, have you tested this on a running cluster?
/assign @alculquicondor

return len(createdWorkload.Status.Conditions) == 2
}, util.ConsistentDuration, util.Interval).Should(gomega.BeTrue())

ginkgo.By("checking the job gets suspended when parallelism changes and the added node selectors are removed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking nit: we could make the integration test shorter as the implementation is pretty well commonized. I'm thinking of this scenario as a potential overkill, but I don't have a strong view as some integration tests are needed as a scenarios like this is probably <1s anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.

we could make the integration test shorter as the implementation is pretty well commonized. I'm thinking of this scenario as a potential overkill

Which does that indicate all integrations or all kubeflow jobs?

Copy link
Contributor

@mimowo mimowo Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have pretty well commonized code at both levels - between integrations, and kubeflow integrations in particular. As a result each kubeflow integration implements only around 10 functions, and all are one-liners.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point.

we could make the integration test shorter as the implementation is pretty well commonized. I'm thinking of this scenario as a potential overkill

Which does that indicate all integrations or all kubeflow jobs?

I would focus only on kubeflow jobs here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. I think moving this case to a unit test might be better since this scenario might be useful to check expected behavior although this is overkill as integration tests.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, still I would kieep some basic integration tests per integratoin, can be follow up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep tracking by creating an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, would you like to create one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize that I forgot to push Submit new Issue button 😞
Created: #1119

@tenzen-y
Copy link
Member Author

LGTM, have you tested this on a running cluster? /assign @alculquicondor

Yes, I verified this feature on a real cluster using the following XGBoostJob:

apiVersion: kubeflow.org/v1
kind: XGBoostJob
metadata:
  name: xgboost-dist-iris-test-train
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: user-queue
spec:
  xgbReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
          - name: xgboost
            image: docker.io/merlintang/xgboost-dist-iris:1.1
            resources:
              requests:
                cpu: 0.5
                memory: 256Mi
            ports:
            - containerPort: 9991
              name: xgboostjob-port
            imagePullPolicy: Always
            args:
              - --job_type=Train
              - --xgboost_parameter=objective:multi:softprob,num_class:3
              - --n_estimators=10
              - --learning_rate=0.1
              - --model_path=/tmp/xgboost-model
              - --model_storage_type=local
    Worker:
      replicas: 2
      restartPolicy: ExitCode
      template:
        spec:
          containers:
          - name: xgboost
            image: docker.io/merlintang/xgboost-dist-iris:1.1
            resources:
              requests:
                cpu: 0.5
                memory: 256Mi
            ports:
            - containerPort: 9991
              name: xgboostjob-port
            imagePullPolicy: Always
            args:
              - --job_type=Train
              - --xgboost_parameter="objective:multi:softprob,num_class:3"
              - --n_estimators=10
              - --learning_rate=0.1

I will create a PR for the document once #1109 (comment) is resolved.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2023
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y
Copy link
Member Author

/remove-kind documentation
/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/documentation Categorizes issue or PR as related to documentation. labels Sep 13, 2023
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2023
@tenzen-y
Copy link
Member Author

/retest
Due to #1090

@tenzen-y
Copy link
Member Author

#1090 happened again :(

/test pull-kueue-test-e2e-main-1-24

@mimowo
Copy link
Contributor

mimowo commented Sep 14, 2023

/lgtm
/assign @trasc
in case @trasc would like to give a review pass as he reviewed the previous kubeflow integrations

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 14, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 915ed626e7bc079fe2a4b79283648146bf2fa7ce

Copy link
Contributor

@trasc trasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tenzen-y
Copy link
Member Author

Let's merge this.
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 15, 2023
@tenzen-y tenzen-y mentioned this pull request Sep 15, 2023
2 tasks
@k8s-ci-robot k8s-ci-robot merged commit 731938d into kubernetes-sigs:main Sep 15, 2023
15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.5 milestone Sep 15, 2023
@tenzen-y tenzen-y deleted the add-support-xgboostjob branch September 15, 2023 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants