Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement webhook validations for the PyTorchJob #2035

Merged
merged 1 commit into from
Apr 10, 2024

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Mar 26, 2024

What this PR does / why we need it:
I implemented webhook validations for the PyTorchJob.
Additionally, I didn't add any additional validations. The traininig-operator has the same validations the same as before.

  1. Implement the cert-generation mechanism using the open-policy-agent/cert-controller.
  2. Move existing PyTorchJob validations to the webhook validations and adjust validation results to the webhook form.
  3. Add some manifests for RBAC and ValidatingWebhookConfiguration.

Note that as the first iteration, I didn't support cert-manager in this PR. Maybe we can support the release after the next release.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #1993

Checklist:

  • Docs included if any changes are user facing

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y tenzen-y force-pushed the implemenet-webhook-validations branch 2 times, most recently from 2396aeb to 56c3669 Compare March 26, 2024 12:39
@coveralls
Copy link

coveralls commented Mar 26, 2024

Pull Request Test Coverage Report for Build 8634700799

Details

  • 56 of 174 (32.18%) changed or added relevant lines in 4 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 35.178%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/webhooks/webhooks.go 0 3 0.0%
pkg/webhooks/pytorch/pytorchjob_webhook.go 56 83 67.47%
pkg/cert/cert.go 0 31 0.0%
cmd/training-operator.v1/main.go 0 57 0.0%
Files with Coverage Reduction New Missed Lines %
cmd/training-operator.v1/main.go 1 0.0%
Totals Coverage Status
Change from base Build 8585006097: -0.2%
Covered Lines: 4335
Relevant Lines: 12323

💛 - Coveralls

@tenzen-y tenzen-y force-pushed the implemenet-webhook-validations branch 2 times, most recently from 5812cb8 to 91dbff2 Compare March 26, 2024 13:31
@tenzen-y tenzen-y force-pushed the implemenet-webhook-validations branch 4 times, most recently from 3ab0cb7 to ffddf1b Compare March 26, 2024 18:06
Comment on lines +15 to +35
trainingoperator.TFJobKind: scaffold,
trainingoperator.MXJobKind: scaffold,
trainingoperator.XGBoostJobKind: scaffold,
trainingoperator.MPIJobKind: scaffold,
trainingoperator.PaddleJobKind: scaffold,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to implement these webhooks in other PRs so that we can keep this PR minimal required.

@tenzen-y tenzen-y force-pushed the implemenet-webhook-validations branch from ffddf1b to e02439b Compare March 26, 2024 18:15
@tenzen-y
Copy link
Member Author

/hold
for review.

@tenzen-y tenzen-y changed the title WIP: Implement webhook validations for the PyTorchJob Implement webhook validations for the PyTorchJob Mar 26, 2024
@tenzen-y
Copy link
Member Author

/assign @andreyvelich @johnugeorge

@tenzen-y
Copy link
Member Author

@andreyvelich @johnugeorge I addressed all comments. PTAL!

Comment on lines 1 to 4
apiVersion: v1
kind: Secret
metadata:
name: training-operator-webhook-secret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to keep it in the Training Operator install similar to Katib: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml#L38-L41 ?
In the future releases when we make the integration with Cert Manager for Kubeflow Install, we can remove this secret generator from Kubeflow Kustomize Overlay

Copy link
Member Author

@tenzen-y tenzen-y Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I understand that. My suggestion is instead of having separate folder: /internalcert, update the standalone and Kubeflow Kustomize installs to generate that secret as follows:

secretGenerator:
  - name: training-operator-webhook-secret
    options:
      disableNameSuffixHash: true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.
That makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@andreyvelich
Copy link
Member

andreyvelich commented Apr 10, 2024

@tenzen-y Do you prefer to introduce Controller Runtime logger for webhook in separate PR based on our discussion here: #2035 (comment) ?

@tenzen-y
Copy link
Member Author

@tenzen-y Do you prefer to introduce Controller Runtime logger for webhook in separate PR based on our discussion here: #2035 (comment) ?

@andreyvelich Yes I do since We need to remove many loggers for the replacements.

@tenzen-y
Copy link
Member Author

@andreyvelich @johnugeorge I believe that this PR is ready for the merge.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tenzen-y!
Just a small comment from me.
/assign @johnugeorge

@@ -8,12 +7,16 @@ metadata:
prometheus.io/port: "8080"
labels:
app: training-operator
name: training-operator
name: training-operator-service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Do you need to move this service to the webhook and name it as training-operator-service?
Why we can't just keep service name as it is in /manifests/base/service.yaml and just add another 9443 port ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is a good catch!
At the some points, we needed to move the Service to the webhook directory to apply the kustomize patch, but currently, moving is not needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tenzen-y tenzen-y force-pushed the implemenet-webhook-validations branch from a05ab1a to bff5b56 Compare April 10, 2024 15:37
@@ -8,12 +7,16 @@ metadata:
prometheus.io/port: "8080"
labels:
app: training-operator
name: training-operator
name: training-operator-service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this renaming ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this discussion, we changed this name. So, would you prefer to keep using the training-operator?

I'm ok with either name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I think I made a mistake there, tbh we should keep this name as training-operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries :)
I reverted this change.

@tenzen-y tenzen-y force-pushed the implemenet-webhook-validations branch 2 times, most recently from df00cc5 to e452469 Compare April 10, 2024 16:15
# TODO (tenzen-y): Once we support cert-manager, we need to remove this secret generation.
# REF: https://github.com/kubeflow/training-operator/issues/2049
secretGenerator:
- name: training-operator-webhook-secret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the secret name, would it be better to just name it: training-operator-webhook-cert. Similar to Katib and not add the Kubernetes resource name (e.g. secret) to the name ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you already mentioned here: #2035 (comment)

Secret name: training-operator-webhook-cert

I'll try to refine this name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@johnugeorge
Copy link
Member

Thanks @tenzen-y for this contribution
LGTM

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y tenzen-y force-pushed the implemenet-webhook-validations branch from e452469 to 3f15cfc Compare April 10, 2024 16:32
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this @tenzen-y!
/lgtm
/assign @kubeflow/wg-training-leads

@google-oss-prow google-oss-prow bot added the lgtm label Apr 10, 2024
@johnugeorge
Copy link
Member

/hold cancel

@google-oss-prow google-oss-prow bot merged commit 7dd032d into master Apr 10, 2024
68 of 69 checks passed
@tenzen-y
Copy link
Member Author

@andreyvelich @johnugeorge Thank you for the review!

@franciscojavierarceo
Copy link
Contributor

@tenzen-y I posted this in slack but I think we need some instructions here for local development.

See #2069

johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants