chore: improve logs displayed when scheduling fails #317

tzneal · 2023-05-03T02:34:14Z

Fixes #

Description
improve logs displayed when scheduling fails
How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

tzneal · 2023-05-03T02:36:39Z

New sample error messages:

  logger.go:130: 2023-05-02T21:34:15.278-0500	ERROR	scheduling/scheduler.go:181	Could not schedule pod, incompatible with provisioner "raccoonpatch-2255-qybye8gsgg", no instance type satisfied resources {"pods":"1"} and requirements karpenter.sh/capacity-type In [on-demand spot], karpenter.sh/provisioner-name In [raccoonpatch-2255-qybye8gsgg], topology.kubernetes.io/zone In [non-existent-zone], testing.karpenter.sh/test-id In [unspecified] (no instance type met the scheduling requirements or had a required offering)	{"pod": "default/snoutkiwi-2257-3mshp0o3xd"}

logger.go:130: 2023-05-02T21:34:23.271-0500	ERROR	scheduling/scheduler.go:181	Could not schedule pod, incompatible with provisioner "tigercotton-2893-ipk2xrqnuh", no instance type satisfied resources {"pods":"1"} and requirements topology.kubernetes.io/zone In [test-zone-3], karpenter.sh/capacity-type In [spot], testing.karpenter.sh/test-id In [unspecified], karpenter.sh/provisioner-name In [tigercotton-2893-ipk2xrqnuh] (no instance type has the required offering)	{"pod": "default/howlershard-2897-ap1g0csuzi"}

logger.go:130: 2023-05-02T21:34:48.181-0500	ERROR	scheduling/scheduler.go:181	Could not schedule pod, incompatible with provisioner "sharkcrocus-4397-c1uqj7foaf", no instance type satisfied resources {"cpu":"14","pods":"1"} and requirements kubernetes.io/arch In [amd64], testing.karpenter.sh/test-id In [unspecified], karpenter.sh/provisioner-name In [sharkcrocus-4397-c1uqj7foaf] (no instance type which had enough resources and the required offering met the scheduling requirements)	{"pod": "default/finchip-4399-3fplmf5fem"}

coveralls · 2023-05-03T02:56:19Z

Pull Request Test Coverage Report for Build 4917124948

61 of 77 (79.22%) changed or added relevant lines in 1 file are covered.
7 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.0007%) to 81.248%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/provisioning/scheduling/machine.go	61	77	79.22%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/node/controller.go	7	70.59%

Totals
Change from base Build 4899965012:	-0.0007%
Covered Lines:	6876
Relevant Lines:	8463

💛 - Coveralls

tzneal · 2023-05-05T13:20:34Z

Negligible difference in scheduling performance:

name               old time/op   new time/op   delta
Scheduling1-12       364µs ± 1%    367µs ± 1%  +1.06%  (p=0.032 n=5+5)
Scheduling50-12     18.6ms ± 2%   19.0ms ± 5%    ~     (p=0.151 n=5+5)
Scheduling100-12    37.0ms ± 2%   37.3ms ± 3%    ~     (p=0.690 n=5+5)
Scheduling500-12     181ms ± 4%    180ms ± 5%    ~     (p=1.000 n=5+5)
Scheduling1000-12    540ms ±74%    388ms ±12%    ~     (p=0.841 n=5+5)
Scheduling2000-12    743ms ± 9%    784ms ± 5%    ~     (p=0.151 n=5+5)
Scheduling5000-12    2.01s ± 3%    1.93s ± 5%    ~     (p=0.310 n=5+5)

ellistarn

Extremely cool

jackfrancis · 2023-05-08T18:39:18Z

pkg/controllers/provisioning/scheduling/machine.go

+	}
+	if r.requirementsAndOffering {
+		return "no instance type which met the scheduling requirements and the required offering had the required resources"
+	}


I think this evolution of log messaging is definitely an improvement, and gives more visibility into what decisions what karpenter + k8s scheduler is doing.

Would it be more readable if we simply returned a more transparent log message w/ the three dimensions and their true/false values? I like the human-centric structure of a sentence, but formulating the combinatoric possibilities of failures here as discrete sentences is hard to get right from a clarity of language point of view.

Sorry, I merged it before I saw your comment. I did it this way to try to be more informative, with three bools you are reporting these 7 possible errors:

R / F / O 0 0 0 - success 0 0 1 - "no instance type has the required offering" 0 1 0 - "no instance type has enough resources" 0 1 1 - "no instance type had enough resources or had a required offering" 1 0 0 - "no instance type met all requirements" 1 0 1 - "no instance type met the scheduling requirements or had a required offering" 1 1 0 - "no instance type met the scheduling requirements or had enough resources" 1 1 1 - "no instance type met the scheduling requirements or had enough resources or had a required offering"

It currently reports four additional errors:

"no instance type which met the scheduling requirements and had enough resources, had a required offering"

"no instance type which had enough resources and the required offering met the scheduling requirements"

"no instance type which met the scheduling requirements and the required offering had the required resources"

"no instance type met the requirements/resources/offering tuple"

I was trying to err on the side of making the currently challenging task of figuring out why your pod won't schedule easier :)

jackfrancis · 2023-05-08T18:40:52Z

(I see this just merged, but I'll keep asking questions :))

Are these scheduling failures during scale out considerations? Or are these orthogonal to scale out/in and evaluated based upon nodes and their capabilities that are actually running in the cluster at the time?

ellistarn · 2023-05-08T18:49:31Z

Happy to review a followon!

tzneal · 2023-05-08T18:54:45Z

Are these scheduling failures during scale out considerations? Or are these orthogonal to scale out/in and evaluated based upon nodes and their capabilities that are actually running in the cluster at the time?

They occur during both provisioning and consolidation, but they're not logged during consolidation. The same scheduling code is used for both, consolidation is essentially just:

Simulate the effect of one or more nodes being deleted by re-scheduling their pods
It the result cheaper?
If yes, do that, otherwise try again on a different candidate(s)

)

tzneal force-pushed the improve-scheduling-error-message branch from 0172653 to 4089837 Compare May 3, 2023 02:39

chore: improve logs displayed when scheduling fails

0a22aea

tzneal force-pushed the improve-scheduling-error-message branch from 4089837 to 0a22aea Compare May 3, 2023 02:44

tzneal marked this pull request as ready for review May 5, 2023 13:20

tzneal requested a review from a team as a code owner May 5, 2023 13:20

tzneal requested a review from ellistarn May 5, 2023 13:20

Merge branch 'main' into improve-scheduling-error-message

f4a658f

ellistarn approved these changes May 8, 2023

View reviewed changes

tzneal merged commit 6a6e5fe into kubernetes-sigs:main May 8, 2023
4 checks passed

tzneal deleted the improve-scheduling-error-message branch May 8, 2023 18:20

jackfrancis reviewed May 8, 2023

View reviewed changes

bwagner5 pushed a commit to bwagner5/karpenter-core that referenced this pull request May 16, 2023

chore: improve logs displayed when scheduling fails (kubernetes-sigs#317

3d73bf3

)

bwagner5 mentioned this pull request May 16, 2023

Release v0.27.4 #339

Merged

bwagner5 pushed a commit that referenced this pull request May 16, 2023

chore: improve logs displayed when scheduling fails (#317)

f18b8e4

njtran mentioned this pull request May 17, 2023

improve "X out of Y instance types were excluded because they would breach provisioner limits" #744

Closed

njtran mentioned this pull request May 31, 2023

Better log message for quota-related failures aws/karpenter-provider-aws#3267

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: improve logs displayed when scheduling fails #317

chore: improve logs displayed when scheduling fails #317

tzneal commented May 3, 2023

tzneal commented May 3, 2023

coveralls commented May 3, 2023 •

edited

tzneal commented May 5, 2023

ellistarn left a comment

jackfrancis May 8, 2023

tzneal May 8, 2023

jackfrancis commented May 8, 2023

ellistarn commented May 8, 2023

tzneal commented May 8, 2023

chore: improve logs displayed when scheduling fails #317

chore: improve logs displayed when scheduling fails #317

Conversation

tzneal commented May 3, 2023

tzneal commented May 3, 2023

coveralls commented May 3, 2023 • edited

Pull Request Test Coverage Report for Build 4917124948

💛 - Coveralls

tzneal commented May 5, 2023

ellistarn left a comment

Choose a reason for hiding this comment

jackfrancis May 8, 2023

Choose a reason for hiding this comment

tzneal May 8, 2023

Choose a reason for hiding this comment

jackfrancis commented May 8, 2023

ellistarn commented May 8, 2023

tzneal commented May 8, 2023

coveralls commented May 3, 2023 •

edited