Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do contextual logging for scheduler #91633

Open
damemi opened this issue Jun 1, 2020 · 113 comments
Open

Do contextual logging for scheduler #91633

damemi opened this issue Jun 1, 2020 · 113 comments
Assignees
Labels
area/logging help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging.

Comments

@damemi
Copy link
Contributor

damemi commented Jun 1, 2020

The kube-scheduler is currently difficult to debug through logs. Logging is inconsistent and sparse, and critically helpful information is in some spots logged at too high of a level to be practically accessible (such as node scores, which are V(10). At this level logs are flooded with low-level information too quickly to even grep).

What would you like to be added:
Standardize scheduler logging to useful, documented levels to improve debuggability and diagnoses of scheduling anomalies. This information should be accessible at reasonable levels so as not to be flooded by irrelevant low-level process and network logs (ie, v=2,3,4..) and consistent across the scheduler.

Specifically, information that is necessary includes, but is not limited to:

  • Filter plugins
    • Per plugin/per node
  • Preemption attempts (/postfilter plugins)
  • Score plugins
    • Normalized results
      • Per plugin/per node
  • Nominated node result
    (sub-lists to indicate potential logging hierarchies. these are just some examples)

In addition, these logs should be able to be easily traced through a scheduling cycle. That is, all logs related to a specific scheduling cycle should have a unique identifier so that relevant information can be efficiently gathered with common search tools like grep.

Why is this needed:
The most common issues with the scheduler are questions related to the scheduling cycle ("Why did/didn't my pod get scheduled here/there?"). While the scheduler does currently report FailedScheduling events, these can lack information (#91340, #91601) and do not provide any insight to successful scheduling that does not land on the expected node (eg, scoring interference).

There is demand to be able to easily debug the steps of the scheduling cycle, and the lack of that ability frequently confuses users. It would be very helpful to be able to say, "If you set v=2, you'll be able to see each plugin's results" for example.

/sig scheduling

@damemi damemi added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 1, 2020
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jun 1, 2020
@damemi
Copy link
Contributor Author

damemi commented Jun 1, 2020

/kind cleanup

@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Jun 1, 2020
@damemi
Copy link
Contributor Author

damemi commented Jun 1, 2020

cc @ahg-g @Huang-Wei @alculquicondor @ingvagabund @ravisantoshgudimetla wdyt? To make sure our logging stays consistent I think it would be helpful to have some kind of design to point to

@ahg-g
Copy link
Member

ahg-g commented Jun 2, 2020

+1

I like the idea of associating the logs with an identifier. We should look at structured logging and see if what you are planning to do aligns with that proposal: kubernetes/enhancements#1602

@alculquicondor
Copy link
Member

+1 on structured logging.

@ingvagabund
Copy link
Contributor

+1
I wish we could have two logging streams, one of them dedicated to just scheduling decisions.

@Huang-Wei
Copy link
Member

+1.

In addition to structured logging (to have consistent identifiers), we should have consistent mechanics to define each log level's semantics. No need to have a KEP, but would be nice to keep it documented in https://github.com/kubernetes/community/tree/master/contributors/devel/sig-scheduling.

@jesusha123
Copy link
Contributor

/assign

@yuzhiquan
Copy link
Member

/cc
/cc @serathius

@serathius
Copy link
Contributor

Definitely it would be good to have cooperation between SIG Instrumentation and Scheduling to collect set requirements for structured logging.
Defining log semantics (aka log schema) is definitely on our plate (was part of original proposal, but was postponed to focus on migration).
As for separating log streams for scheduling decisions, it would be interesting to use more https://github.com/go-logr/logr (generic logging interface that we are using underneath klog) features like named loggers

Adding logr authors to discussion.
/cc @thockin @DirectXMan12

@soltysh
Copy link
Contributor

soltysh commented Jun 3, 2020

/assign

@jherrera123 I'd suggest syncing with @damemi on this, since he initiated this bug to start working on it, as far as I can tell.

@damemi
Copy link
Contributor Author

damemi commented Jun 3, 2020

Thanks for the discussion so far everyone. I wasn't totally aware of the structured logging KEP, but I think this does align well with it. I see that there are even sections of the scheduler that have already been identified as good spots to add structured logging.

Like @Huang-Wei said, in addition to logging identifiers it's important that we document consistent log levels so users can know exactly what they're getting at each level. I think this goes beyond the existing structured logging KEP (and is scheduler-specific), so this issue isn't quite a duplicate of the KEP.

and @jherrera123 I haven't started work on this yet (still getting feedback here) but I won't turn down any offers to help :)

@jesusha123
Copy link
Contributor

@damemi once we land on a design, I'd be happy to help. It looks like there's a lot to be done in the logging area.

@damemi
Copy link
Contributor Author

damemi commented Aug 17, 2020

Sorry all, haven't forgotten about this. Read up some more on the structured logging and it doesn't look too bad to implement. I think the big thing (at least for the scheduler) would be something like the named loggers @serathius mentioned in #91633 (comment). Though, I think we could get sufficient context with just a pod key in the structured logs, at least for now.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2020
@damemi
Copy link
Contributor Author

damemi commented Nov 16, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2020
@alculquicondor
Copy link
Member

alculquicondor commented Nov 16, 2020

@damemi could you make a list of what's needed to have kube-scheduler in a desirable state with regard to structured logging?

@damemi
Copy link
Contributor Author

damemi commented Nov 16, 2020

@damemi could you make a list of what's needed to have kube-scheduler in a desirable state with regard to structured logging?

Yeah, I'll work on putting together what we talked about here. The main improvement would be logging a unique key for each pod through its scheduling cycle like mentioned above

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 14, 2021
@gavinfish
Copy link
Contributor

/remove-lifecycle stale

@mengjiao-liu
Copy link
Member

mengjiao-liu commented Mar 14, 2023

Ref #111155 (comment)

Everything under pkg/scheduler/apis can be its own PR.

The rest is a bit unclear. Maybe it's possible to have a PR for each of the folders in pkg/scheduler/framework/plugins. Followed by a PR that does the rest.

If we start the task, we need to figure out how to split PR.
I am not quite clear about the directory structure of the scheduler, so I will first divide PR from the direction provided by @alculquicondor and then adjust it according to the actual implementation.

Some package context updates need to depend on other packages. Do the ones that don't depend first.

✔️ : means that PR has been merged and this task is complete.

Component PR Status
pkg/scheduler/apis/ The updates needed for v1 have been added in PR #115588, v1beta2 and v1beta3 packages are already marked as deprecated. No need to add now. #111155 (comment). ✔️
pkg/scheduler/framework/plugins/defaultbinder/ #116571 ✔️
pkg/scheduler/framework/plugins/interpodaffinity/ #116635 ✔️
pkg/scheduler/framework/plugins/noderesources/ #116748 ✔️
pkg/scheduler/framework/plugins/podtopologyspread #116797 ✔️
pkg/scheduler/framework/plugins/volumezone/ #116829 ✔️
pkg/scheduler/framework/preemption/
pkg/scheduler/framework/plugins/defaultpreemption/
#116835 ✔️
pkg/scheduler/framework/runtime #116842 ✔️
pkg/scheduler/framework/plugins/examples/stateful/ #116885 ✔️
pkg/scheduler/framework/plugins/nodevolumelimits/ #116884 ✔️
pkg/scheduler/framework/plugins/volumebinding/ #116803 ✔️
cmd/kube-scheduler/app/server.go pkg/scheduler/eventhandlers.go & eventhandlers_test.go
pkg/scheduler/internal
pkg/scheduler/schedule_one.go & scheduler_test.go
pkg/scheduler/schedule_one.go & schedule_one_test.go
test/integration/daemonset/
other files...
#116849 ✔️
hack/logcheck.conf (until all are migrated) #120933 WIP

@pohly
Copy link
Contributor

pohly commented Mar 14, 2023

I would prefer to have one owner for this issue. All of the changes exist, it's just a matter of splitting up the large PR into smaller pieces and then staying on top of concurrent changes to the code.

@mengjiao-liu: if you feel that you can handle this, then I'd be happy to have you work on this and we won't need a second assignee.

@mengjiao-liu
Copy link
Member

mengjiao-liu commented Mar 14, 2023

I would prefer to have one owner for this issue. All of the changes exist, it's just a matter of splitting up the large PR into smaller pieces and then staying on top of concurrent changes to the code.

@mengjiao-liu: if you feel that you can handle this, then I'd be happy to have you work on this and we won't need a second assignee.

Okay, I can handle this and I'll change the table: the column name in the second column of the table from "Owner" -> "PR".

@pohly
Copy link
Contributor

pohly commented Mar 17, 2023

/unassign
/assign @mengjiao-liu

@alculquicondor
Copy link
Member

what PR should I review first? :)

@mengjiao-liu
Copy link
Member

mengjiao-liu commented Apr 12, 2023

what PR should I review first? :)

@alculquicondor I have marked the PRs that have dependencies. PRs that do not write dependencies can be reviewed. You can start with #116842, #116849, or some plugins PRs😀.

In order to avoid being reviewed by mistake, the PR I will need to depend on is still in Draft status.

@mengjiao-liu
Copy link
Member

/sig instrumentation
/wg structured-logging
/area logging

@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. area/logging labels Jul 10, 2023
@k8s-ci-robot
Copy link
Contributor

@mengjiao-liu: Closing this issue.

In response to this:

Work continues in #91633 (comment)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mengjiao-liu
Copy link
Member

mengjiao-liu commented Jul 10, 2023

/reopen

Closed the issue by mistake.

@k8s-ci-robot k8s-ci-robot reopened this Jul 10, 2023
@k8s-ci-robot
Copy link
Contributor

@mengjiao-liu: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/logging help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging.
Projects
Status: In Progress
Development

No branches or pull requests