Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: improve logs displayed when scheduling fails #317

Merged

Conversation

tzneal
Copy link
Contributor

@tzneal tzneal commented May 3, 2023

Fixes #

Description
improve logs displayed when scheduling fails
How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@tzneal
Copy link
Contributor Author

tzneal commented May 3, 2023

New sample error messages:

  logger.go:130: 2023-05-02T21:34:15.278-0500	ERROR	scheduling/scheduler.go:181	Could not schedule pod, incompatible with provisioner "raccoonpatch-2255-qybye8gsgg", no instance type satisfied resources {"pods":"1"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/provisioner-name In [raccoonpatch-2255-qybye8gsgg], topology.kubernetes.io/zone In [non-existent-zone], testing.karpenter.sh/test-id In [unspecified] (no instance type met the scheduling requirements or had a required offering)	{"pod": "default/snoutkiwi-2257-3mshp0o3xd"}
logger.go:130: 2023-05-02T21:34:23.271-0500	ERROR	scheduling/scheduler.go:181	Could not schedule pod, incompatible with provisioner "tigercotton-2893-ipk2xrqnuh", no instance type satisfied resources {"pods":"1"} and requirements topology.kubernetes.io/zone In [test-zone-3], karpenter.sh/capacity-type In [spot], testing.karpenter.sh/test-id In [unspecified], karpenter.sh/provisioner-name In [tigercotton-2893-ipk2xrqnuh] (no instance type has the required offering)	{"pod": "default/howlershard-2897-ap1g0csuzi"}
logger.go:130: 2023-05-02T21:34:48.181-0500	ERROR	scheduling/scheduler.go:181	Could not schedule pod, incompatible with provisioner "sharkcrocus-4397-c1uqj7foaf", no instance type satisfied resources {"cpu":"14","pods":"1"} and requirements kubernetes.io/arch In [amd64], testing.karpenter.sh/test-id In [unspecified], karpenter.sh/provisioner-name In [sharkcrocus-4397-c1uqj7foaf] (no instance type which had enough resources and the required offering met the scheduling requirements)	{"pod": "default/finchip-4399-3fplmf5fem"}

@tzneal tzneal force-pushed the improve-scheduling-error-message branch from 0172653 to 4089837 Compare May 3, 2023 02:39
@tzneal tzneal force-pushed the improve-scheduling-error-message branch from 4089837 to 0a22aea Compare May 3, 2023 02:44
@coveralls
Copy link

coveralls commented May 3, 2023

Pull Request Test Coverage Report for Build 4917124948

  • 61 of 77 (79.22%) changed or added relevant lines in 1 file are covered.
  • 7 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.0007%) to 81.248%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/provisioning/scheduling/machine.go 61 77 79.22%
Files with Coverage Reduction New Missed Lines %
pkg/controllers/node/controller.go 7 70.59%
Totals Coverage Status
Change from base Build 4899965012: -0.0007%
Covered Lines: 6876
Relevant Lines: 8463

💛 - Coveralls

@tzneal
Copy link
Contributor Author

tzneal commented May 5, 2023

Negligible difference in scheduling performance:

name               old time/op   new time/op   delta
Scheduling1-12       364µs ± 1%    367µs ± 1%  +1.06%  (p=0.032 n=5+5)
Scheduling50-12     18.6ms ± 2%   19.0ms ± 5%    ~     (p=0.151 n=5+5)
Scheduling100-12    37.0ms ± 2%   37.3ms ± 3%    ~     (p=0.690 n=5+5)
Scheduling500-12     181ms ± 4%    180ms ± 5%    ~     (p=1.000 n=5+5)
Scheduling1000-12    540ms ±74%    388ms ±12%    ~     (p=0.841 n=5+5)
Scheduling2000-12    743ms ± 9%    784ms ± 5%    ~     (p=0.151 n=5+5)
Scheduling5000-12    2.01s ± 3%    1.93s ± 5%    ~     (p=0.310 n=5+5)

@tzneal tzneal marked this pull request as ready for review May 5, 2023 13:20
@tzneal tzneal requested a review from a team as a code owner May 5, 2023 13:20
@tzneal tzneal requested a review from ellistarn May 5, 2023 13:20
Copy link
Contributor

@ellistarn ellistarn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely cool

@tzneal tzneal merged commit 6a6e5fe into kubernetes-sigs:main May 8, 2023
4 checks passed
@tzneal tzneal deleted the improve-scheduling-error-message branch May 8, 2023 18:20
}
if r.requirementsAndOffering {
return "no instance type which met the scheduling requirements and the required offering had the required resources"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this evolution of log messaging is definitely an improvement, and gives more visibility into what decisions what karpenter + k8s scheduler is doing.

Would it be more readable if we simply returned a more transparent log message w/ the three dimensions and their true/false values? I like the human-centric structure of a sentence, but formulating the combinatoric possibilities of failures here as discrete sentences is hard to get right from a clarity of language point of view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I merged it before I saw your comment. I did it this way to try to be more informative, with three bools you are reporting these 7 possible errors:

R / F / O
0 0 0 - success
0 0 1 - "no instance type has the required offering"
0 1 0 - "no instance type has enough resources"
0 1 1 - "no instance type had enough resources or had a required offering"
1 0 0 - "no instance type met all requirements"
1 0 1 - "no instance type met the scheduling requirements or had a required offering"
1 1 0 - "no instance type met the scheduling requirements or had enough resources"
1 1 1 - "no instance type met the scheduling requirements or had enough resources or had a required offering"

It currently reports four additional errors:

  • "no instance type which met the scheduling requirements and had enough resources, had a required offering"
  • "no instance type which had enough resources and the required offering met the scheduling requirements"
  • "no instance type which met the scheduling requirements and the required offering had the required resources"
  • "no instance type met the requirements/resources/offering tuple"

I was trying to err on the side of making the currently challenging task of figuring out why your pod won't schedule easier :)

@jackfrancis
Copy link
Contributor

(I see this just merged, but I'll keep asking questions :))

Are these scheduling failures during scale out considerations? Or are these orthogonal to scale out/in and evaluated based upon nodes and their capabilities that are actually running in the cluster at the time?

@ellistarn
Copy link
Contributor

Happy to review a followon!

@tzneal
Copy link
Contributor Author

tzneal commented May 8, 2023

Are these scheduling failures during scale out considerations? Or are these orthogonal to scale out/in and evaluated based upon nodes and their capabilities that are actually running in the cluster at the time?

They occur during both provisioning and consolidation, but they're not logged during consolidation. The same scheduling code is used for both, consolidation is essentially just:

  1. Simulate the effect of one or more nodes being deleted by re-scheduling their pods
  2. It the result cheaper?
  3. If yes, do that, otherwise try again on a different candidate(s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants