New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return all scheduler predicate failures instead of the first one #86022
Return all scheduler predicate failures instead of the first one #86022
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6578038
to
a73e0f2
Compare
code Code | ||
message string | ||
code Code | ||
failures []error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly think this is not proper and confusing. I don't really think we need it either, why not just concatenate the reasons in PredicateResultToFrameworkStatus
into a single string that we store in message? This is basically what utilerrors.NewAggregate
returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For each predicate/plugin, it may have multiple internal (atomic) failure reasons. Suppose the number is n
, in theory, there could be 2^n
possible aggregated failure. Take PodFitsResources
for example, there could be insufficient resource on cpu, memory, ephemeral storage, or extended resource, if we just concatenate the aggregated failures into a string, that string wouldn't be merged properly well with the failures in other nodes into the final histogram message:
For an aggregated string case:
- node1: <insufficient cpu, memory, ephemeral stroage>
- node2: <insufficient cpu>
- node3: <insufficient memory>
The message users can see will be:
0/3 nodes are available: 1 Insufficient cpu, memory, ephemeral stroage, 1 Insufficient cpu, 1 Insufficient memory
NOTE: the message length showed here becomes O(2n) instead of O(n)..
While with this PR, and also which is consistent with the old behavior:
- node1: <insufficient cpu>, <insufficient memory>, <insufficient ephemeral stroage>
- node2: <insufficient cpu>
- node3: <insufficient memory>
The message users can see will be:
0/3 nodes are available: 2 Insufficient cpu, 2 Insufficient memory, 1 Insufficient ephemeral stroage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so the concern is the number of combinations in the histogram.
Since "error" is not the right type to explain all Status codes (e.g., Unschedulable is not an error), I suggest we have a list of strings instead, we can name it for example reasons []strings
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG.
code Code | ||
message string | ||
code Code | ||
failures []error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so the concern is the number of combinations in the histogram.
Since "error" is not the right type to explain all Status codes (e.g., Unschedulable is not an error), I suggest we have a list of strings instead, we can name it for example reasons []strings
?
return "" | ||
} | ||
return s.message | ||
return s.AsError().Error() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need to return the list of reasons/errors so that we can iterate over them when generating the histogram: https://github.com/kubernetes/kubernetes/blob/a73e0f2112d285e4872037428dff8dda55229039/pkg/scheduler/core/generic_scheduler.go#L99
I think we should have two functions, one is Message
which concatenates all reasons, and one named Reasons
which return the slice. Reasons
is used when generating the histogram, and ```Message`` used for logs and all other purposes.
} | ||
|
||
// NewStatus makes a Status out of the given arguments and returns its pointer. | ||
func NewStatus(code Code, msg string) *Status { | ||
return &Status{ | ||
code: code, | ||
message: msg, | ||
code: code, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can update the function like this: func NewStatus(code Code, reasons ...string) *Status
, and so we don't need NewStatusWithFailures
a73e0f2
to
4003f4c
Compare
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few nits, please also squash.
@@ -105,13 +105,14 @@ var unresolvablePredicateFailureErrors = map[PredicateFailureReason]struct{}{ | |||
|
|||
// UnresolvablePredicateExists checks if there is at least one unresolvable predicate failure reason, if true | |||
// returns the first one in the list. | |||
func UnresolvablePredicateExists(reasons []PredicateFailureReason) PredicateFailureReason { | |||
func UnresolvablePredicateExists(reasons []PredicateFailureReason) []string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, how about we just return boolean?
return framework.NewStatus(framework.UnschedulableAndUnresolvable, r.GetReason()) | ||
var failureReasons []string | ||
if failureReasons = predicates.UnresolvablePredicateExists(reasons); len(failureReasons) != 0 { | ||
return framework.NewStatus(framework.UnschedulableAndUnresolvable, failureReasons...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if UnresolvablePredicateExists
returns boolean, then we can do this:
code := framework.Unschedulable
if predicates.UnresolvablePredicateExists(reasons) {
code := framework.UnschedulableAndUnresolvable
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG. The only side effect is that for some predicate/fitler plugin, the chances are some reasons are UnschedulableAndUnresolvable
and some are just Unschedulable
, then here we're returning all reasons instead of only UnschedulableAndUnresolvable
reasons. But it should be good overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for _, reason := range reasons { | ||
failureReasons = append(failureReasons, reason.GetReason()) | ||
} | ||
return framework.NewStatus(framework.Unschedulable, failureReasons...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here: return framework.NewStatus(code, failureReasons...)
func (s *Status) Message() string { | ||
if s == nil { | ||
if s == nil || s.IsSuccess() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can keep it as is, we don't need to check for IsSuccess
func (s *Status) AsError() error { | ||
if s.IsSuccess() { | ||
msg := s.Message() | ||
if msg == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditt, we can keep this as is
4003f4c
to
3dfdab6
Compare
3dfdab6
to
a136108
Compare
/retest |
/lgtm Thanks Wei! |
Thanks @ahg-g for reviewing! |
/hold cancel |
What type of PR is this?
/kind bug
/sig scheduling
What this PR does / why we need it:
Scheduled used to report all failure reasons upon a predicate failure. For example, if a Pod requests excessive cpu and memory, running
kubectl describe pod <pod name>
will get message:However, in 1.17, we introduced changes which only returns the first failure reason. For the above example, it reports:
It's not appropriate since:
Which issue(s) this PR fixes:
Part of #85918.
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/cc @ahg-g