Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom CRD: Support dynamic Trial's jobs conditions #1307

Merged

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Aug 21, 2020

Related: #1214.
WIP until we merge this: #1305, after that I will generate API clients, etc, but it would be great if you can start to review it.

I changed a bit API from proposal to support various success and failure conditions in CRD.
As I can see not all CRDs are using the same status approach with Conditions list as Kubeflow operators or BatchJobs.
For example, Argo uses .status.phase to define whether Node is succeeded or failed.

To handle all of these cases, I proposed approach when user defines SuccessCondition and FailureCondition in Experiment.

This condition is gjson expression which can directly point when job is succeeded or failed.

For example, for TFJob:

successCondition = "status.conditions.#(type=="Succeeded")#|#(status=="True")#"
failureCondition = "status.conditions.#(type=="Failed")#|#(status=="True")#"

For Argo Workflow:

successCondition = "status.[@this].#(phase=="Succeeded")"
failureCondition =  "status.[@this].#(phase=="Failed")"

I think this approach is extensible, since user can define any expression there when they want to fail Trial.
For example, when TFJob reaches 3 failed replicas.
status.replicaStatuses.master.[@this].#(failed=="3")

For Kubeflow operators or BatchJobs we can define these conditions in advance if we don't want to let user manually add them.

I renamed UpdateTrialStatusCondition to UpdateTrialStatusConditionDeprecated, since I had to refactor UpdateTrialStatusCondition.

In job_util.go I created TrialJobStatus which represents current Job status.

  1. When Job is created/running status is Running.
  2. If Job reaches successCondition status is Succeeded.
  3. If Job reaches failureCondition status is Failed.

I also transfer Message and Reason from deployed Job to Trial if it is possible.

/assign @gaocegege @johnugeorge @sperlingxx
/cc @czheng94 @nielsmeima

@k8s-ci-robot
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: czheng94, nielsmeima.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Related: #1214.
WIP until we merge this: #1305, after that I will generate API clients, etc, but it would be great if you can start to review it.

I changed a bit API from proposal to support various success and failure conditions in CRD.
As I can see not all CRDs are using the same status approach with Conditions list as Kubeflow operators or BatchJobs.
For example, Argo uses .status.phase to define whether Node is succeeded or failed.

To handle all of these cases, I proposed approach when user defines SuccessCondition and FailureCondition in Experiment.

This condition is gjson expression which can directly point when job is succeeded or failed.

For example, for TFJob:

successCondition = "status.conditions.#(type=="Succeeded")#|#(status=="True")#"
failureCondition = "status.conditions.#(type=="Failed")#|#(status=="True")#"

For Argo Workflow:

successCondition = "status.[@this].#(phase=="Succeeded")"
failureCondition =  "status.[@this].#(phase=="Failed")"

I think this approach is extensible, since user can define any expression there when they want to fail Trial.
For example, when TFJob reaches 3 failed replicas.
status.replicaStatuses.master.[@this].#(active=="1")

For Kubeflow operators or BatchJobs we can define these conditions in advance if we don't want to let user manually add them.

I renamed UpdateTrialStatusCondition to UpdateTrialStatusConditionDeprecated, since I had to refactor UpdateTrialStatusCondition.

In job_util.go I created TrialJobStatus which represents current Job status.

  1. When Job is created/running status is Running.
  2. If Job reaches successCondition status is Succeeded.
  3. If Job reaches failureCondition status is Failed.

I also transfer Message and Reason from deployed Job to Trial if it is possible.

/assign @gaocegege @johnugeorge @sperlingxx
/cc @czheng94 @nielsmeima

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubeflow-bot
Copy link

This change is Reviewable

@andreyvelich
Copy link
Member Author

@gaocegege @johnugeorge @sperlingxx This PR is ready for review.

@andreyvelich andreyvelich changed the title [WIP] Custom CRD: Support dynamic Trial's jobs conditions Custom CRD: Support dynamic Trial's jobs conditions Sep 3, 2020
@andreyvelich
Copy link
Member Author

/retest

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @johnugeorge

}

// TODO (andreyvelich): Can be deleted after custom CRD is implemented
func (r *ReconcileTrial) UpdateTrialStatusConditionDeprecated(instance *trialsv1beta1.Trial, deployedJob *unstructured.Unstructured, jobCondition *commonv1.JobCondition) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete it now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaocegege We can delete it after we finish Custom CRD implementation.
Controller still runs this function if successCondition and failureCondition is not set here:

r.UpdateTrialStatusConditionDeprecated(instance, deployedJob, jobCondition)

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@johnugeorge
Copy link
Member

Please keep track of the cleanup required to remove the deprecated functions.

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@johnugeorge
Copy link
Member

/lgtm

@andreyvelich
Copy link
Member Author

/retest

1 similar comment
@andreyvelich
Copy link
Member Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit 7b797e1 into kubeflow:master Sep 8, 2020
@andreyvelich andreyvelich deleted the issue-1214-custom-conditions branch October 2, 2021 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants