New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: enable configurable rotation duration and parallel rotate #114301
kubelet: enable configurable rotation duration and parallel rotate #114301
Conversation
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Tue Dec 6 03:52:59 UTC 2022. |
/test pull-kubernetes-node-e2e-containerd |
/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2 |
d155491
to
35a6c70
Compare
/cc @dims |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
/test pull-kubernetes-node-e2e-containerd |
2f1017f
to
36e9118
Compare
/retest |
@harshanarayana i was thinking that this is a bug in our implementation and not a new feature we are adding. So one way to avoid an explicit feature gate is by leaving the two new parameters to zero and leave the older behavior when these are zero. I'll try to find someone to help with reviews etc. |
03ce326
to
ab8c784
Compare
Changes pushed based on comments from #114301 (comment) |
@liggitt finally (🤞🏾 ) it's ready. please |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, harshanarayana, liggitt, mrunalp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Now that this is approved, @harshanarayana can you open up a k/website PR documenting this? |
@kannon92 definitely.. Do you have a preferred section where you want rhinos updated? Also, what all do you think would be useful to be out inito the k/website? |
Maybe a new section here: https://kubernetes.io/docs/concepts/cluster-administration/logging/#log-rotation |
allErrors = append(allErrors, fmt.Errorf("invalid configuration: containerLogMaxWorkers must be greater than or equal to 1")) | ||
} | ||
|
||
if kc.ContainerLogMonitorInterval.Duration.Seconds() < 3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you decide on 3 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cartermckinnon It was based on this comment #114301 (comment)
In order to perform an efficient log rotation in clusters where the volume of the logs generated by
the workload is large, kubelet also provides a mechanism to tune how the logs are rotated in
terms of how many concurrent log rotations can be performed and the interval at which the logs are
monitored and rotated as required. These attributes can be configured by setting `containerLogMaxWorkers`
and `containerLogMonitorInterval`. @kannon92 Something like this under the log rotate section is good enough ? |
@harshanarayana can you open up a PR on k/website? and we can iterate on the language. I think that is a good start. |
@kannon92 Done. PTAL when you can |
/lgtm |
LGTM label has been added. Git tree hash: dc3c24f6b2ae8143463e3f50dc35f7e0e4c02fe8
|
/hold cancel |
@harshanarayana: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
Is there some reason this isn't marked kind/feature? It adds two new fields and a bunch of new, useful behavior |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Current implementation of the
container_log_manager.go
works with the following behaviorThis behavior causes a few issues with increase in rate at which the containers will start logging. As suggested in #110630 for example where the containers were generating logs at a monumentally high rate of 6M/second.
Due to the time required to iterate over each container, compress and rotate the files, the log rotation can never really honor the configured value of
container-log-max-files
andcontainer-log-max-size
on a cluster with this high log generation rate + a large number of such pods running on the cluster.What this PR does to mitigate this is enable two things.
How does this PR enable it ?
It adds two additional configuration to the kubelet's configuration parameters
containerLogMaxWorkers
containerLogMonitorInterval
One that cater to enable parallel log rotation with a configurable number of workers based on the pod running capacity and the log generation capacity of the cluster and the other to configure how Frequently the logs are monitored and rotated.
How does this work ?
When the container log manager is started off, it creates n equal to
ContainerLogMaxWorkers
number of go routine based workers that are responsible for ensuring the log rotation workflow for the pods in the cluster. A loop running at an interval defined byContainerLogMonitorPeriod
lists all the containers and finds out the running containers and pushes that into a Queue.This queue is then processed by the worker to rotate the logs.
Currently, there is a wait loop in the tail end to ensure the queue is empty before handing over the control back for the next iteration of the monitoring based on
ContainerLogMonitorPeriod
to avoid the same container getting processed by multiple workers at the same time.Which issue(s) this PR fixes:
Fixes #110630
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
Additional Note
I am hoping to get some suggestion on the following
Note to Reviewers
Even with this fix, the log rotation will not be perfect. Considering the fact that we run the log rotation from outside and via a regular interval based lookup, this will work more or less like how the logrotate like infra would do the job. i.e The size of the file can always exceed the max capped size of the log before we actually rotate it via the kubelet.
The safest way to ensure the rotation of the log as close to the limit as defined in teh config is to setup a notify watch for each log file in question and rotate them. But that can prove to be a really costly operation to be dealt with when the number of containers are really large in size. Also terminating the notify when the pod is deleted or terminated can be an additonal overhead to deal with. But I am happy explore further into that as well if someone thinks that is the better way to rotate and manage the logs.
As you can see the logs are still not getting truncated fast enough but much better than they did before