Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend metrics with the new labels #113324

Merged
merged 2 commits into from Oct 31, 2022

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Oct 25, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

It extends the job metrics:

  • job_finished_total - by a new label reason which can take values: "BackoffLimitExceeded", "DeadlineExceeded", "PodFailurePolicy", "". The metric corresponding to the new label counts the number of jobs failed with a given result. The reason field is left empty for successful jobs.
  • pod_failures_handled_by_pod_failure_policy_total - new counter metric with the action which can take values: "FailJob", "Ignore", "Count". The metric counts the number of pods handled by a given pod failure policy action.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Extend the job `job_finished_total metric by new `reason` label and introduce a new job metric to count pod failures
handled by pod failure policy with respect to the action applied.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 25, 2022
@k8s-ci-robot
Copy link
Contributor

@mimowo: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Oct 25, 2022
@mimowo mimowo force-pushed the handling-pod-failures-metrics branch from c6f96bc to 2e07ae2 Compare October 25, 2022 12:31
@k8s-ci-robot k8s-ci-robot added area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 25, 2022
@mimowo
Copy link
Contributor Author

mimowo commented Oct 25, 2022

/sig apps

@mimowo
Copy link
Contributor Author

mimowo commented Oct 25, 2022

/assign @alculquicondor

@mimowo mimowo changed the title Extend metrics with the new fields Extend metrics with the new labels Oct 25, 2022
@mimowo mimowo force-pushed the handling-pod-failures-metrics branch 2 times, most recently from 3cf440a to 1f1cb97 Compare October 26, 2022 08:38
@mimowo
Copy link
Contributor Author

mimowo commented Oct 26, 2022

/retest

@@ -84,6 +86,18 @@ var (
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the help text above to indicate that pods that fit the "ignore" rule are not counted.

pkg/controller/job/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/controller/job/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller.go Outdated Show resolved Hide resolved
@alculquicondor
Copy link
Member

LGTM, but I better give it another pass after kubecon.

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/hold
for suggestions

Comment on lines 723 to 718
if tc.wantAction == nil {
if action != nil {
t.Errorf("Unexpected job failure polic action. Got: %q", *action)
}
} else {
if action == nil {
t.Errorf("Missing job failure policy action. want: %q", *tc.wantAction)
} else if *tc.wantAction != *action {
t.Errorf("Unexpected job failure policy action. want: %v. got: %v", tc.wantAction, action)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simplify all of this with cmp.Equal or cmp.Diff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, used cmp.Diff for this and the check for the failure message

job *batchv1.Job
wantJobFinishedNumMetric metricLabelsWithValue
wantJobPodsFinishedMetric metricLabelsWithValue
job *batchv1.Job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the direction that TestMetrics is going. Instead, we could test metrics per feature in their corresponding tests.

For example, for the metric you are adding, you could introduce the check in TestJobPodFailurePolicy.

If it's too late to change it, we can merge as-is and fix it before the test freeze.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I have extracted the checks against the changed metrics to dedicated tests.
Also, in order to prevent overloading the TestMetrics test in the future I suggest renaming it as TestMetricsOnSuccesses. PTAL.

}
}

if tc.wantJobFailure {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already tested in other tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this field was to select the appropriate waiting method rather than to do a check. Anyway, the field is gone after refactoring to extract the checks to dedicated tests.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 28, 2022
@mimowo mimowo force-pushed the handling-pod-failures-metrics branch from 313b992 to 311b3dc Compare October 31, 2022 08:25
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 31, 2022
@mimowo mimowo force-pushed the handling-pod-failures-metrics branch from 311b3dc to fb119f3 Compare October 31, 2022 08:36
}
job_index := 0 // job index to avoid collisions between job names created by different test cases
for name, tc := range testCases {
for _, wFinalizers := range []bool{false, true} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just test with true, we no longer care about non-finalizers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -149,6 +149,147 @@ func TestMetrics(t *testing.T) {
}
}

func TestJobFinishedNumReasonMetric(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you cover this in TestJobPodFailurePolicy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically I can, but it seems reasonable to keep decoupled - similarly as we decided to split TestMetrics into smaller pieces. For example, the TestJobFinishedNumReasonMetric tests has a scenario "non-indexed job; failed" which expects BackoffLimitExceeded reason. It does not seem to belong conceptually to TestJobPodFailurePolicy

@mimowo mimowo force-pushed the handling-pod-failures-metrics branch from fb119f3 to d7bd920 Compare October 31, 2022 14:03
@alculquicondor
Copy link
Member

/lgtm
/approve
/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 31, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mimowo
Copy link
Contributor Author

mimowo commented Oct 31, 2022

@alculquicondor please unhold

@alculquicondor
Copy link
Member

Feel free to unhold when the hold was due to a nit
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2022
@k8s-ci-robot k8s-ci-robot merged commit 3628532 into kubernetes:master Oct 31, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Oct 31, 2022
jaehnri pushed a commit to jaehnri/kubernetes that referenced this pull request Jan 3, 2023
* Extend job metrics

* Refactor TestMetrics to extract its checks into dedicated tests per feature
@mimowo mimowo deleted the handling-pod-failures-metrics branch March 18, 2023 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants