New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend metrics with the new labels #113324
Extend metrics with the new labels #113324
Conversation
@mimowo: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
c6f96bc
to
2e07ae2
Compare
/sig apps |
/assign @alculquicondor |
3cf440a
to
1f1cb97
Compare
/retest |
@@ -84,6 +86,18 @@ var ( | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the help text above to indicate that pods that fit the "ignore" rule are not counted.
LGTM, but I better give it another pass after kubecon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/hold
for suggestions
if tc.wantAction == nil { | ||
if action != nil { | ||
t.Errorf("Unexpected job failure polic action. Got: %q", *action) | ||
} | ||
} else { | ||
if action == nil { | ||
t.Errorf("Missing job failure policy action. want: %q", *tc.wantAction) | ||
} else if *tc.wantAction != *action { | ||
t.Errorf("Unexpected job failure policy action. want: %v. got: %v", tc.wantAction, action) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simplify all of this with cmp.Equal
or cmp.Diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, used cmp.Diff for this and the check for the failure message
test/integration/job/job_test.go
Outdated
job *batchv1.Job | ||
wantJobFinishedNumMetric metricLabelsWithValue | ||
wantJobPodsFinishedMetric metricLabelsWithValue | ||
job *batchv1.Job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the direction that TestMetrics
is going. Instead, we could test metrics per feature in their corresponding tests.
For example, for the metric you are adding, you could introduce the check in TestJobPodFailurePolicy
.
If it's too late to change it, we can merge as-is and fix it before the test freeze.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I have extracted the checks against the changed metrics to dedicated tests.
Also, in order to prevent overloading the TestMetrics test in the future I suggest renaming it as TestMetricsOnSuccesses
. PTAL.
test/integration/job/job_test.go
Outdated
} | ||
} | ||
|
||
if tc.wantJobFailure { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already tested in other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this field was to select the appropriate waiting method rather than to do a check. Anyway, the field is gone after refactoring to extract the checks to dedicated tests.
313b992
to
311b3dc
Compare
311b3dc
to
fb119f3
Compare
test/integration/job/job_test.go
Outdated
} | ||
job_index := 0 // job index to avoid collisions between job names created by different test cases | ||
for name, tc := range testCases { | ||
for _, wFinalizers := range []bool{false, true} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just test with true, we no longer care about non-finalizers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -149,6 +149,147 @@ func TestMetrics(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestJobFinishedNumReasonMetric(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you cover this in TestJobPodFailurePolicy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically I can, but it seems reasonable to keep decoupled - similarly as we decided to split TestMetrics into smaller pieces. For example, the TestJobFinishedNumReasonMetric
tests has a scenario "non-indexed job; failed" which expects BackoffLimitExceeded reason. It does not seem to belong conceptually to TestJobPodFailurePolicy
fb119f3
to
d7bd920
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@alculquicondor please unhold |
Feel free to unhold when the hold was due to a nit |
* Extend job metrics * Refactor TestMetrics to extract its checks into dedicated tests per feature
What type of PR is this?
/kind feature
What this PR does / why we need it:
It extends the job metrics:
job_finished_total
- by a new labelreason
which can take values: "BackoffLimitExceeded", "DeadlineExceeded", "PodFailurePolicy", "". The metric corresponding to the new label counts the number of jobs failed with a given result. Thereason
field is left empty for successful jobs.pod_failures_handled_by_pod_failure_policy_total
- new counter metric with theaction
which can take values: "FailJob", "Ignore", "Count". The metric counts the number of pods handled by a given pod failure policy action.Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: