New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do contextual logging for scheduler #91633
Comments
/kind cleanup |
cc @ahg-g @Huang-Wei @alculquicondor @ingvagabund @ravisantoshgudimetla wdyt? To make sure our logging stays consistent I think it would be helpful to have some kind of design to point to |
+1 I like the idea of associating the logs with an identifier. We should look at structured logging and see if what you are planning to do aligns with that proposal: kubernetes/enhancements#1602 |
+1 on structured logging. |
+1 |
+1. In addition to structured logging (to have consistent identifiers), we should have consistent mechanics to define each log level's semantics. No need to have a KEP, but would be nice to keep it documented in https://github.com/kubernetes/community/tree/master/contributors/devel/sig-scheduling. |
/assign |
/cc |
Definitely it would be good to have cooperation between SIG Instrumentation and Scheduling to collect set requirements for structured logging. Adding logr authors to discussion. |
@jherrera123 I'd suggest syncing with @damemi on this, since he initiated this bug to start working on it, as far as I can tell. |
Thanks for the discussion so far everyone. I wasn't totally aware of the structured logging KEP, but I think this does align well with it. I see that there are even sections of the scheduler that have already been identified as good spots to add structured logging. Like @Huang-Wei said, in addition to logging identifiers it's important that we document consistent log levels so users can know exactly what they're getting at each level. I think this goes beyond the existing structured logging KEP (and is scheduler-specific), so this issue isn't quite a duplicate of the KEP. and @jherrera123 I haven't started work on this yet (still getting feedback here) but I won't turn down any offers to help :) |
@damemi once we land on a design, I'd be happy to help. It looks like there's a lot to be done in the logging area. |
Sorry all, haven't forgotten about this. Read up some more on the structured logging and it doesn't look too bad to implement. I think the big thing (at least for the scheduler) would be something like the named loggers @serathius mentioned in #91633 (comment). Though, I think we could get sufficient context with just a pod key in the structured logs, at least for now. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
@damemi could you make a list of what's needed to have kube-scheduler in a desirable state with regard to structured logging? |
Yeah, I'll work on putting together what we talked about here. The main improvement would be logging a unique key for each pod through its scheduling cycle like mentioned above |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
If we start the task, we need to figure out how to split PR. Some package context updates need to depend on other packages. Do the ones that don't depend first. ✔️ : means that PR has been merged and this task is complete.
|
I would prefer to have one owner for this issue. All of the changes exist, it's just a matter of splitting up the large PR into smaller pieces and then staying on top of concurrent changes to the code. @mengjiao-liu: if you feel that you can handle this, then I'd be happy to have you work on this and we won't need a second assignee. |
Okay, I can handle this and I'll change the table: the column name in the second column of the table from "Owner" -> "PR". |
/unassign |
what PR should I review first? :) |
@alculquicondor I have marked the PRs that have dependencies. PRs that do not write dependencies can be reviewed. You can start with #116842, #116849, or some plugins PRs😀. In order to avoid being reviewed by mistake, the PR I will need to depend on is still in |
/sig instrumentation |
@mengjiao-liu: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen Closed the issue by mistake. |
@mengjiao-liu: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The kube-scheduler is currently difficult to debug through logs. Logging is inconsistent and sparse, and critically helpful information is in some spots logged at too high of a level to be practically accessible (such as node scores, which are
V(10)
. At this level logs are flooded with low-level information too quickly to even grep).What would you like to be added:
Standardize scheduler logging to useful, documented levels to improve debuggability and diagnoses of scheduling anomalies. This information should be accessible at reasonable levels so as not to be flooded by irrelevant low-level process and network logs (ie,
v=2,3,4..
) and consistent across the scheduler.Specifically, information that is necessary includes, but is not limited to:
(sub-lists to indicate potential logging hierarchies. these are just some examples)
In addition, these logs should be able to be easily traced through a scheduling cycle. That is, all logs related to a specific scheduling cycle should have a unique identifier so that relevant information can be efficiently gathered with common search tools like grep.
Why is this needed:
The most common issues with the scheduler are questions related to the scheduling cycle ("Why did/didn't my pod get scheduled here/there?"). While the scheduler does currently report
FailedScheduling
events, these can lack information (#91340, #91601) and do not provide any insight to successful scheduling that does not land on the expected node (eg, scoring interference).There is demand to be able to easily debug the steps of the scheduling cycle, and the lack of that ability frequently confuses users. It would be very helpful to be able to say, "If you set
v=2
, you'll be able to see each plugin's results" for example./sig scheduling
The text was updated successfully, but these errors were encountered: