Do contextual logging for scheduler #91633

damemi · 2020-06-01T15:16:45Z

The kube-scheduler is currently difficult to debug through logs. Logging is inconsistent and sparse, and critically helpful information is in some spots logged at too high of a level to be practically accessible (such as node scores, which are V(10). At this level logs are flooded with low-level information too quickly to even grep).

What would you like to be added:
Standardize scheduler logging to useful, documented levels to improve debuggability and diagnoses of scheduling anomalies. This information should be accessible at reasonable levels so as not to be flooded by irrelevant low-level process and network logs (ie, v=2,3,4..) and consistent across the scheduler.

Specifically, information that is necessary includes, but is not limited to:

Filter plugins
- Per plugin/per node
Preemption attempts (/postfilter plugins)
Score plugins
- Normalized results
  - Per plugin/per node
Nominated node result
(sub-lists to indicate potential logging hierarchies. these are just some examples)

In addition, these logs should be able to be easily traced through a scheduling cycle. That is, all logs related to a specific scheduling cycle should have a unique identifier so that relevant information can be efficiently gathered with common search tools like grep.

Why is this needed:
The most common issues with the scheduler are questions related to the scheduling cycle ("Why did/didn't my pod get scheduled here/there?"). While the scheduler does currently report FailedScheduling events, these can lack information (#91340, #91601) and do not provide any insight to successful scheduling that does not land on the expected node (eg, scoring interference).

There is demand to be able to easily debug the steps of the scheduling cycle, and the lack of that ability frequently confuses users. It would be very helpful to be able to say, "If you set v=2, you'll be able to see each plugin's results" for example.

/sig scheduling

The text was updated successfully, but these errors were encountered:

damemi · 2020-06-01T15:19:26Z

/kind cleanup

damemi · 2020-06-01T15:21:38Z

cc @ahg-g @Huang-Wei @alculquicondor @ingvagabund @ravisantoshgudimetla wdyt? To make sure our logging stays consistent I think it would be helpful to have some kind of design to point to

ahg-g · 2020-06-02T12:52:24Z

+1

I like the idea of associating the logs with an identifier. We should look at structured logging and see if what you are planning to do aligns with that proposal: kubernetes/enhancements#1602

alculquicondor · 2020-06-02T13:46:48Z

+1 on structured logging.

ingvagabund · 2020-06-02T14:37:51Z

+1
I wish we could have two logging streams, one of them dedicated to just scheduling decisions.

Huang-Wei · 2020-06-02T19:13:47Z

+1.

In addition to structured logging (to have consistent identifiers), we should have consistent mechanics to define each log level's semantics. No need to have a KEP, but would be nice to keep it documented in https://github.com/kubernetes/community/tree/master/contributors/devel/sig-scheduling.

jesusha123 · 2020-06-02T19:52:14Z

/assign

yuzhiquan · 2020-06-03T10:56:57Z

/cc
/cc @serathius

serathius · 2020-06-03T11:31:22Z

Definitely it would be good to have cooperation between SIG Instrumentation and Scheduling to collect set requirements for structured logging.
Defining log semantics (aka log schema) is definitely on our plate (was part of original proposal, but was postponed to focus on migration).
As for separating log streams for scheduling decisions, it would be interesting to use more https://github.com/go-logr/logr (generic logging interface that we are using underneath klog) features like named loggers

Adding logr authors to discussion.
/cc @thockin @DirectXMan12

soltysh · 2020-06-03T13:15:42Z

/assign

@jherrera123 I'd suggest syncing with @damemi on this, since he initiated this bug to start working on it, as far as I can tell.

damemi · 2020-06-03T18:35:54Z

Thanks for the discussion so far everyone. I wasn't totally aware of the structured logging KEP, but I think this does align well with it. I see that there are even sections of the scheduler that have already been identified as good spots to add structured logging.

Like @Huang-Wei said, in addition to logging identifiers it's important that we document consistent log levels so users can know exactly what they're getting at each level. I think this goes beyond the existing structured logging KEP (and is scheduler-specific), so this issue isn't quite a duplicate of the KEP.

and @jherrera123 I haven't started work on this yet (still getting feedback here) but I won't turn down any offers to help :)

jesusha123 · 2020-06-04T02:13:50Z

@damemi once we land on a design, I'd be happy to help. It looks like there's a lot to be done in the logging area.

damemi · 2020-08-17T19:42:33Z

Sorry all, haven't forgotten about this. Read up some more on the structured logging and it doesn't look too bad to implement. I think the big thing (at least for the scheduler) would be something like the named loggers @serathius mentioned in #91633 (comment). Though, I think we could get sufficient context with just a pod key in the structured logs, at least for now.

fejta-bot · 2020-11-15T20:00:21Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

damemi · 2020-11-16T14:19:35Z

/remove-lifecycle stale

alculquicondor · 2020-11-16T14:58:03Z

@damemi could you make a list of what's needed to have kube-scheduler in a desirable state with regard to structured logging?

damemi · 2020-11-16T17:55:33Z

@damemi could you make a list of what's needed to have kube-scheduler in a desirable state with regard to structured logging?

Yeah, I'll work on putting together what we talked about here. The main improvement would be logging a unique key for each pod through its scheduling cycle like mentioned above

fejta-bot · 2021-02-14T18:35:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

gavinfish · 2021-03-07T10:03:38Z

/remove-lifecycle stale

mengjiao-liu · 2023-03-14T08:40:21Z

Ref #111155 (comment)

Everything under pkg/scheduler/apis can be its own PR.

The rest is a bit unclear. Maybe it's possible to have a PR for each of the folders in pkg/scheduler/framework/plugins. Followed by a PR that does the rest.

If we start the task, we need to figure out how to split PR.
I am not quite clear about the directory structure of the scheduler, so I will first divide PR from the direction provided by @alculquicondor and then adjust it according to the actual implementation.

Some package context updates need to depend on other packages. Do the ones that don't depend first.

✔️ : means that PR has been merged and this task is complete.

Component	PR	Status
pkg/scheduler/apis/	The updates needed for v1 have been added in PR #115588, v1beta2 and v1beta3 packages are already marked as deprecated. No need to add now. #111155 (comment).	✔️
pkg/scheduler/framework/plugins/defaultbinder/	#116571	✔️
pkg/scheduler/framework/plugins/interpodaffinity/	#116635	✔️
pkg/scheduler/framework/plugins/noderesources/	#116748	✔️
pkg/scheduler/framework/plugins/podtopologyspread	#116797	✔️
pkg/scheduler/framework/plugins/volumezone/	#116829	✔️
pkg/scheduler/framework/preemption/ pkg/scheduler/framework/plugins/defaultpreemption/	#116835	✔️
pkg/scheduler/framework/runtime	#116842	✔️
pkg/scheduler/framework/plugins/examples/stateful/	#116885	✔️
pkg/scheduler/framework/plugins/nodevolumelimits/	#116884	✔️
pkg/scheduler/framework/plugins/volumebinding/	#116803	✔️
cmd/kube-scheduler/app/server.go pkg/scheduler/eventhandlers.go & eventhandlers_test.go pkg/scheduler/internal pkg/scheduler/schedule_one.go & scheduler_test.go pkg/scheduler/schedule_one.go & schedule_one_test.go test/integration/daemonset/ other files...	#116849	✔️
hack/logcheck.conf (until all are migrated)	#120933	WIP

pohly · 2023-03-14T09:29:23Z

I would prefer to have one owner for this issue. All of the changes exist, it's just a matter of splitting up the large PR into smaller pieces and then staying on top of concurrent changes to the code.

@mengjiao-liu: if you feel that you can handle this, then I'd be happy to have you work on this and we won't need a second assignee.

mengjiao-liu · 2023-03-14T09:31:08Z

I would prefer to have one owner for this issue. All of the changes exist, it's just a matter of splitting up the large PR into smaller pieces and then staying on top of concurrent changes to the code.

@mengjiao-liu: if you feel that you can handle this, then I'd be happy to have you work on this and we won't need a second assignee.

Okay, I can handle this and I'll change the table: the column name in the second column of the table from "Owner" -> "PR".

pohly · 2023-03-17T09:42:08Z

/unassign
/assign @mengjiao-liu

alculquicondor · 2023-03-28T20:06:17Z

what PR should I review first? :)

mengjiao-liu · 2023-04-12T07:43:09Z

what PR should I review first? :)

@alculquicondor I have marked the PRs that have dependencies. PRs that do not write dependencies can be reviewed. You can start with #116842, #116849, or some plugins PRs😀.

In order to avoid being reviewed by mistake, the PR I will need to depend on is still in Draft status.

mengjiao-liu · 2023-07-10T10:39:21Z

/sig instrumentation
/wg structured-logging
/area logging

k8s-ci-robot · 2023-07-10T10:39:46Z

@mengjiao-liu: Closing this issue.

In response to this:

Work continues in #91633 (comment)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mengjiao-liu · 2023-07-10T10:40:22Z

/reopen

Closed the issue by mistake.

k8s-ci-robot · 2023-07-10T10:40:27Z

@mengjiao-liu: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

damemi added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 1, 2020

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jun 1, 2020

damemi mentioned this issue Jun 1, 2020

Some nodes are not considered in scheduling when there is zone imbalance #91601

Closed

k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Jun 1, 2020

k8s-ci-robot assigned jesusha123 Jun 2, 2020

jesusha123 mentioned this issue Jun 12, 2020

REQUEST: New membership for jherrera123 kubernetes/org#1926

Closed

6 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2020

damemi mentioned this issue Dec 3, 2020

Use structured logs kubernetes-sigs/scheduler-plugins#120

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2021

damemi mentioned this issue Feb 24, 2021

Surface more information about plugin scores in scheduler #99411

Merged

mengjiao-liu mentioned this issue Mar 14, 2023

Migrate pkg/scheduler/framework/plugins/defaultbinder to contextual logging #116571

Merged

This was referenced Mar 15, 2023

Migrate pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go to contextual logging #116633

Closed

Migrated pkg/scheduler/framework/plugins/interpodaffinity to contextual logging #116635

Merged

k8s-ci-robot assigned mengjiao-liu and unassigned pohly Mar 17, 2023

pohly mentioned this issue Mar 17, 2023

contextual logging kubernetes/enhancements#3077

Open

12 tasks

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. area/logging labels Jul 10, 2023

k8s-ci-robot closed this as completed Jul 10, 2023

k8s-ci-robot reopened this Jul 10, 2023

mengjiao-liu mentioned this issue Sep 29, 2023

kube-scheduler: convert the remaining part to use contextual logging #120933

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do contextual logging for scheduler #91633

Do contextual logging for scheduler #91633

damemi commented Jun 1, 2020

damemi commented Jun 1, 2020

damemi commented Jun 1, 2020

ahg-g commented Jun 2, 2020

alculquicondor commented Jun 2, 2020

ingvagabund commented Jun 2, 2020

Huang-Wei commented Jun 2, 2020

jesusha123 commented Jun 2, 2020

yuzhiquan commented Jun 3, 2020

serathius commented Jun 3, 2020

soltysh commented Jun 3, 2020

damemi commented Jun 3, 2020

jesusha123 commented Jun 4, 2020

damemi commented Aug 17, 2020

fejta-bot commented Nov 15, 2020

damemi commented Nov 16, 2020

alculquicondor commented Nov 16, 2020 •

edited

damemi commented Nov 16, 2020

fejta-bot commented Feb 14, 2021

gavinfish commented Mar 7, 2021

mengjiao-liu commented Mar 14, 2023 •

edited

pohly commented Mar 14, 2023

mengjiao-liu commented Mar 14, 2023 •

edited

pohly commented Mar 17, 2023

alculquicondor commented Mar 28, 2023

mengjiao-liu commented Apr 12, 2023 •

edited

mengjiao-liu commented Jul 10, 2023

k8s-ci-robot commented Jul 10, 2023

mengjiao-liu commented Jul 10, 2023 •

edited

k8s-ci-robot commented Jul 10, 2023

Do contextual logging for scheduler #91633

Do contextual logging for scheduler #91633

Comments

damemi commented Jun 1, 2020

damemi commented Jun 1, 2020

damemi commented Jun 1, 2020

ahg-g commented Jun 2, 2020

alculquicondor commented Jun 2, 2020

ingvagabund commented Jun 2, 2020

Huang-Wei commented Jun 2, 2020

jesusha123 commented Jun 2, 2020

yuzhiquan commented Jun 3, 2020

serathius commented Jun 3, 2020

soltysh commented Jun 3, 2020

damemi commented Jun 3, 2020

jesusha123 commented Jun 4, 2020

damemi commented Aug 17, 2020

fejta-bot commented Nov 15, 2020

damemi commented Nov 16, 2020

alculquicondor commented Nov 16, 2020 • edited

damemi commented Nov 16, 2020

fejta-bot commented Feb 14, 2021

gavinfish commented Mar 7, 2021

mengjiao-liu commented Mar 14, 2023 • edited

pohly commented Mar 14, 2023

mengjiao-liu commented Mar 14, 2023 • edited

pohly commented Mar 17, 2023

alculquicondor commented Mar 28, 2023

mengjiao-liu commented Apr 12, 2023 • edited

mengjiao-liu commented Jul 10, 2023

k8s-ci-robot commented Jul 10, 2023

mengjiao-liu commented Jul 10, 2023 • edited

k8s-ci-robot commented Jul 10, 2023

alculquicondor commented Nov 16, 2020 •

edited

mengjiao-liu commented Mar 14, 2023 •

edited

mengjiao-liu commented Mar 14, 2023 •

edited

mengjiao-liu commented Apr 12, 2023 •

edited

mengjiao-liu commented Jul 10, 2023 •

edited