New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return non-zero code on logexporter failures #95563
Conversation
@tosi3k: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I would like to hide it behind some flag so as to not enable it by default everywhere. Separately I think it would be better to have that threshold configurable. Maybe we could merge above and have a flag that default to disabling, but control the threshold? Percent (or fractional ratio) is better than node count as the flag will land in some presets. |
2b80f2a
to
c416393
Compare
Added Edit: renamed it to |
c416393
to
c1cbc5d
Compare
/retest |
c1cbc5d
to
0d8d847
Compare
/retest |
0d8d847
to
0836220
Compare
cluster/log-dump/log-dump.sh
Outdated
# TODO: Get rid of all the sourcing of bash dependencies eventually. | ||
function setup() { | ||
LOG_DUMP_EXIT_CODE=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lower case and move variable initialization outside of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cluster/log-dump/log-dump.sh
Outdated
@@ -63,6 +63,13 @@ readonly windows_node_otherfiles="C:\\Windows\\MEMORY.dmp" | |||
# file descriptors for large clusters. | |||
readonly max_dump_processes=25 | |||
|
|||
# Default exit code for the log dump process. | |||
log_dump_exit_code=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as this is rather a marker that log exporting failed, than an exit code that is always used, could you reflect that in variable name and use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Done.
cluster/log-dump/log-dump.sh
Outdated
@@ -675,6 +689,9 @@ function main() { | |||
fi | |||
|
|||
detect_node_failures | |||
if [[ ${log_dump_expected_success_percentage} -gt 0 ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after switch from error code to flag indicating failure:
if log exporting failed; then
if [[ ${log_dump_expected_success_percentage} -gt 0 ]]; then
return 1
fi
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or combine both in one condition if more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combined both conditions into a single one. Done.
0836220
to
a1de410
Compare
cluster/log-dump/log-dump.sh
Outdated
@@ -675,6 +690,9 @@ function main() { | |||
fi | |||
|
|||
detect_node_failures | |||
if [[ ${log_dump_expected_success_percentage} -gt 0 && ${logexporter_failed} -ne 0 ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would reverse conditions to be more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh, the current order sounds more natural to me (if failing_on_unsuccessful_log_dumping_enabled && logexporter_failed
) but I might be slightly biased here because I wrote that logic. Nevertheless, I'll switch the conditions if their sequence sounds more natural.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/lgtm But please verify if it is working in some manual test |
a1de410
to
f698b00
Compare
Tested it using a broken
When /hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment - other than that LGTM.
cluster/log-dump/log-dump.sh
Outdated
@@ -611,6 +621,11 @@ function dump_nodes_with_logexporter() { | |||
done | |||
fi | |||
|
|||
# If less than a certain ratio of the nodes got logexported, report an error. | |||
if [[ $((${#NODE_NAMES[@]} - ${#failed_nodes[@]})) -lt $((${#NODE_NAMES[@]} * log_dump_expected_success_percentage / 100)) ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dividing is always tricky - let's instead do:
100 * (node_names - failed_node) < node_names * percentage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dividing is always tricky
Especially tricky in bash which doesn't know anything about floating-point numbers ;) thanks for the remark, done.
f698b00
to
e42a8db
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tosi3k, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-e2e-gce-ubuntu-containerd |
/retest Review the full test history for this PR. Silence the bot with an |
6 similar comments
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
/retest Review the full test history for this PR. Silence the bot with an |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When
logexporter
daemonset fails to be created, it is not reported in our CI tests as a failure.This will make the
log-dump.sh
process return a non-zero code wheneverlogexporter
daemonset creation fails or less than half of the nodes got logexported successfully.Which issue(s) this PR fixes:
No issue is opened for that.
Special notes for your reviewer:
/sig scalability
/assign @jkaniuk @wojtek-t
Does this PR introduce a user-facing change?: