Extend metrics with the new labels #113324

mimowo · 2022-10-25T12:31:18Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

It extends the job metrics:

job_finished_total - by a new label reason which can take values: "BackoffLimitExceeded", "DeadlineExceeded", "PodFailurePolicy", "". The metric corresponding to the new label counts the number of jobs failed with a given result. The reason field is left empty for successful jobs.
pod_failures_handled_by_pod_failure_policy_total - new counter metric with the action which can take values: "FailJob", "Ignore", "Count". The metric counts the number of pods handled by a given pod failure policy action.

Which issue(s) this PR fixes:

Tracking issue: Retriable and non-retriable Pod failures for Jobs enhancements#3329

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Extend the job `job_finished_total metric by new `reason` label and introduce a new job metric to count pod failures
handled by pod failure policy with respect to the action applied.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures

k8s-ci-robot · 2022-10-25T12:31:27Z

@mimowo: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mimowo · 2022-10-25T12:32:36Z

/sig apps

mimowo · 2022-10-25T12:33:10Z

/assign @alculquicondor

mimowo · 2022-10-26T11:48:34Z

/retest

alculquicondor · 2022-10-26T16:07:45Z

pkg/controller/job/metrics/metrics.go

@@ -84,6 +86,18 @@ var (
 		},


update the help text above to indicate that pods that fit the "ignore" rule are not counted.

pkg/controller/job/metrics/metrics.go

pkg/controller/job/job_controller.go

alculquicondor · 2022-10-27T13:21:39Z

LGTM, but I better give it another pass after kubecon.

alculquicondor

/approve
/hold
for suggestions

alculquicondor · 2022-10-28T21:38:54Z

pkg/controller/job/pod_failure_policy_test.go

+			if tc.wantAction == nil {
+				if action != nil {
+					t.Errorf("Unexpected job failure polic action. Got: %q", *action)
+				}
+			} else {
+				if action == nil {
+					t.Errorf("Missing job failure policy action. want: %q", *tc.wantAction)
+				} else if *tc.wantAction != *action {
+					t.Errorf("Unexpected job failure policy action. want: %v. got: %v", tc.wantAction, action)
+				}
+			}


you can simplify all of this with cmp.Equal or cmp.Diff

Indeed, used cmp.Diff for this and the check for the failure message

alculquicondor · 2022-10-28T21:44:15Z

test/integration/job/job_test.go

-		job                       *batchv1.Job
-		wantJobFinishedNumMetric  metricLabelsWithValue
-		wantJobPodsFinishedMetric metricLabelsWithValue
+		job                                      *batchv1.Job


I don't like the direction that TestMetrics is going. Instead, we could test metrics per feature in their corresponding tests.

For example, for the metric you are adding, you could introduce the check in TestJobPodFailurePolicy.

If it's too late to change it, we can merge as-is and fix it before the test freeze.

Agree, I have extracted the checks against the changed metrics to dedicated tests.
Also, in order to prevent overloading the TestMetrics test in the future I suggest renaming it as TestMetricsOnSuccesses. PTAL.

alculquicondor · 2022-10-28T21:45:12Z

test/integration/job/job_test.go

+				}
+			}
+
+			if tc.wantJobFailure {


This is already tested in other tests.

The purpose of this field was to select the appropriate waiting method rather than to do a check. Anyway, the field is gone after refactoring to extract the checks to dedicated tests.

alculquicondor · 2022-10-31T13:39:51Z

test/integration/job/job_test.go

+	}
+	job_index := 0 // job index to avoid collisions between job names created by different test cases
+	for name, tc := range testCases {
+		for _, wFinalizers := range []bool{false, true} {


Just test with true, we no longer care about non-finalizers

alculquicondor · 2022-10-31T13:41:21Z

test/integration/job/job_test.go

@@ -149,6 +149,147 @@ func TestMetrics(t *testing.T) {
 	}
 }

+func TestJobFinishedNumReasonMetric(t *testing.T) {


can you cover this in TestJobPodFailurePolicy?

technically I can, but it seems reasonable to keep decoupled - similarly as we decided to split TestMetrics into smaller pieces. For example, the TestJobFinishedNumReasonMetric tests has a scenario "non-indexed job; failed" which expects BackoffLimitExceeded reason. It does not seem to belong conceptually to TestJobPodFailurePolicy

…eature

alculquicondor · 2022-10-31T14:14:45Z

/lgtm
/approve
/label tide/merge-method-squash

k8s-ci-robot · 2022-10-31T14:15:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/job/OWNERS~~ [alculquicondor]
~~test/integration/job/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2022-10-31T15:20:31Z

@alculquicondor please unhold

alculquicondor · 2022-10-31T15:49:09Z

Feel free to unhold when the hold was due to a nit
/hold cancel

* Extend job metrics * Refactor TestMetrics to extract its checks into dedicated tests per feature

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Oct 25, 2022

mimowo force-pushed the handling-pod-failures-metrics branch from c6f96bc to 2e07ae2 Compare October 25, 2022 12:31

k8s-ci-robot added area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 25, 2022

k8s-ci-robot requested review from alculquicondor and soltysh October 25, 2022 12:32

k8s-ci-robot assigned alculquicondor Oct 25, 2022

mimowo changed the title ~~Extend metrics with the new fields~~ Extend metrics with the new labels Oct 25, 2022

mimowo force-pushed the handling-pod-failures-metrics branch 2 times, most recently from 3cf440a to 1f1cb97 Compare October 26, 2022 08:38

alculquicondor reviewed Oct 26, 2022

View reviewed changes

mimowo mentioned this pull request Oct 27, 2022

Enable the "Retriable and non-retriable pod failures for jobs" feature into beta #113360

Merged

alculquicondor mentioned this pull request Oct 27, 2022

Retriable and non-retriable Pod failures for Jobs kubernetes/enhancements#3329

Open

8 tasks

alculquicondor reviewed Oct 28, 2022

View reviewed changes

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 28, 2022

Extend job metrics

f249547

mimowo force-pushed the handling-pod-failures-metrics branch from 313b992 to 311b3dc Compare October 31, 2022 08:25

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 31, 2022

mimowo force-pushed the handling-pod-failures-metrics branch from 311b3dc to fb119f3 Compare October 31, 2022 08:36

alculquicondor reviewed Oct 31, 2022

View reviewed changes

Refactor TestMetrics to extract its checks into dedicated tests per f…

d7bd920

…eature

mimowo force-pushed the handling-pod-failures-metrics branch from fb119f3 to d7bd920 Compare October 31, 2022 14:03

k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 31, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2022

k8s-ci-robot merged commit 3628532 into kubernetes:master Oct 31, 2022

k8s-ci-robot added this to the v1.26 milestone Oct 31, 2022

mimowo mentioned this pull request Nov 10, 2022

Self-nominate mimowo as a reviewer for pkg/controller/job & test/integration/job packages #113196

Merged

7 tasks

jaehnri pushed a commit to jaehnri/kubernetes that referenced this pull request Jan 3, 2023

Extend metrics with the new labels (kubernetes#113324)

f05af74

* Extend job metrics * Refactor TestMetrics to extract its checks into dedicated tests per feature

mimowo deleted the handling-pod-failures-metrics branch March 18, 2023 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend metrics with the new labels #113324

Extend metrics with the new labels #113324

mimowo commented Oct 25, 2022 •

edited

k8s-ci-robot commented Oct 25, 2022

mimowo commented Oct 25, 2022

mimowo commented Oct 25, 2022

mimowo commented Oct 26, 2022

alculquicondor Oct 26, 2022

alculquicondor commented Oct 27, 2022

alculquicondor left a comment

alculquicondor Oct 28, 2022

mimowo Oct 31, 2022

alculquicondor Oct 28, 2022

mimowo Oct 31, 2022

alculquicondor Oct 28, 2022

mimowo Oct 31, 2022

alculquicondor Oct 31, 2022

mimowo Oct 31, 2022

alculquicondor Oct 31, 2022

mimowo Oct 31, 2022

alculquicondor commented Oct 31, 2022

k8s-ci-robot commented Oct 31, 2022

mimowo commented Oct 31, 2022

alculquicondor commented Oct 31, 2022

Extend metrics with the new labels #113324

Extend metrics with the new labels #113324

Conversation

mimowo commented Oct 25, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Oct 25, 2022

mimowo commented Oct 25, 2022

mimowo commented Oct 25, 2022

mimowo commented Oct 26, 2022

Choose a reason for hiding this comment

alculquicondor commented Oct 27, 2022

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Oct 31, 2022

k8s-ci-robot commented Oct 31, 2022

mimowo commented Oct 31, 2022

alculquicondor commented Oct 31, 2022

mimowo commented Oct 25, 2022 •

edited