kubelet: enable configurable rotation duration and parallel rotate #114301

harshanarayana · 2022-12-06T06:22:42Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Current implementation of the container_log_manager.go works with the following behavior

It monitors the logs every 10seconds (not configurable)
Rotates the logs in a single loop one after the other for each container.

This behavior causes a few issues with increase in rate at which the containers will start logging. As suggested in #110630 for example where the containers were generating logs at a monumentally high rate of 6M/second.

Due to the time required to iterate over each container, compress and rotate the files, the log rotation can never really honor the configured value of container-log-max-files and container-log-max-size on a cluster with this high log generation rate + a large number of such pods running on the cluster.

What this PR does to mitigate this is enable two things.

Enable Parallel log rotation based on a work queue
Enable configurable Duration for monitoring the logs for rotation. This can enable a faster more suitable monitoring duration for the logs based on your cluster's log generation capability requirements.

How does this PR enable it ?

It adds two additional configuration to the kubelet's configuration parameters

containerLogMaxWorkers
containerLogMonitorInterval

One that cater to enable parallel log rotation with a configurable number of workers based on the pod running capacity and the log generation capacity of the cluster and the other to configure how Frequently the logs are monitored and rotated.

How does this work ?

When the container log manager is started off, it creates n equal to ContainerLogMaxWorkers number of go routine based workers that are responsible for ensuring the log rotation workflow for the pods in the cluster. A loop running at an interval defined by ContainerLogMonitorPeriod lists all the containers and finds out the running containers and pushes that into a Queue.

This queue is then processed by the worker to rotate the logs.

Currently, there is a wait loop in the tail end to ensure the queue is empty before handing over the control back for the next iteration of the monitoring based on ContainerLogMonitorPeriod to avoid the same container getting processed by multiple workers at the same time.

Which issue(s) this PR fixes:

Fixes #110630

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Enabled a mechanism for concurrent log rotatation via `kubelet` using a configuration entity of `containerLogMaxWorkers` which controls the maximum number of concurrent rotation that can be performed and an interval configuration of `containerLogMonitorInterval` that can aid is configuring the monitoring duration to best suite your cluster's log generation standards.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NA

Additional Note

I am hoping to get some suggestion on the following

I intend to feature gate this so that we can fallback to the old behavior in case if things don't work out as intended. Is there a guidelines for this ? Should there be a KEP first ?
This also introduces two new Attributes to the kubelet's configuration parameters. i.e API change. Should this go via the KEP first ?

Note to Reviewers

Even with this fix, the log rotation will not be perfect. Considering the fact that we run the log rotation from outside and via a regular interval based lookup, this will work more or less like how the logrotate like infra would do the job. i.e The size of the file can always exceed the max capped size of the log before we actually rotate it via the kubelet.

The safest way to ensure the rotation of the log as close to the limit as defined in teh config is to setup a notify watch for each log file in question and rotate them. But that can prove to be a really costly operation to be dealt with when the number of containers are really large in size. Also terminating the notify when the pod is deleted or terminated can be an additonal overhead to deal with. But I am happy explore further into that as well if someone thinks that is the better way to rotate and manage the logs.

root@log-rotate-control-plane:/var/log/pods#  ls -ltrh default_logging-test-0_dd27f6ef-7479-401e-93bc-1aa9c052b92d/**/
total 41M
-rw-r--r-- 1 root root 103K Dec 15 15:00 0.log.20221215-150017.gz
-rw-r--r-- 1 root root 103K Dec 15 15:00 0.log.20221215-150018.gz
-rw-r----- 1 root root  14M Dec 15 15:00 0.log.20221215-150020
-rw-r--r-- 1 root root 103K Dec 15 15:00 0.log.20221215-150019.gz
-rw-r----- 1 root root    0 Dec 15 15:00 0.log
-rw-r--r-- 1 root root 103K Dec 15 15:01 1.log.20221215-150124.gz
-rw-r--r-- 1 root root 103K Dec 15 15:01 1.log.20221215-150125.gz
-rw-r--r-- 1 root root 103K Dec 15 15:01 1.log.20221215-150126.gz
-rw-r----- 1 root root  19M Dec 15 15:01 1.log.20221215-150128
-rw-r----- 1 root root 8.1M Dec 15 15:01 1.log
root@log-rotate-control-plane:/var/log/pods# cat /var/lib/kubelet/config.yaml | grep -i containerLog
containerLogMaxWorkers: 10
containerLogMonitorPeriod: 5ms
containerLogMaxSize: 5Mi

default_logging-test-0_dcaea560-3715-4010-8dd7-0c6141a3feb8/logging-test:
total 14M
-rw-r--r-- 1 root root 103K Dec 15 15:05 0.log.20221215-150529.gz
-rw-r--r-- 1 root root 103K Dec 15 15:05 0.log.20221215-150530.gz
-rw-r----- 1 root root  14M Dec 15 15:05 0.log.20221215-150532
-rw-r--r-- 1 root root 103K Dec 15 15:05 0.log.20221215-150531.gz
-rw-r----- 1 root root    0 Dec 15 15:05 0.log
---- truncated....
default_logging-test-0_dcaea560-3715-4010-8dd7-0c6141a3feb8/logging-test:
total 14M
-rw-r--r-- 1 root root 103K Dec 15 15:05 0.log.20221215-150529.gz
-rw-r--r-- 1 root root 103K Dec 15 15:05 0.log.20221215-150530.gz
-rw-r----- 1 root root  14M Dec 15 15:05 0.log.20221215-150532
-rw-r--r-- 1 root root 103K Dec 15 15:05 0.log.20221215-150531.gz
-rw-r----- 1 root root    0 Dec 15 15:05 **0.log**

As you can see the logs are still not getting truncated fast enough but much better than they did before

k8s-ci-robot · 2022-12-06T06:22:46Z

Please note that we're already in Test Freeze for the release-1.26 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.26.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Tue Dec 6 03:52:59 UTC 2022.

harshanarayana · 2022-12-06T07:06:45Z

/test pull-kubernetes-node-e2e-containerd

harshanarayana · 2022-12-06T07:06:54Z

/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2

harshanarayana · 2022-12-06T08:53:01Z

/cc @dims

k8s-triage-robot · 2022-12-06T10:06:32Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

harshanarayana · 2022-12-06T10:15:19Z

/test pull-kubernetes-node-e2e-containerd

dims · 2022-12-07T02:42:27Z

cc @SergeyKanzhelev @mrunalp @bobbypage

dims · 2022-12-07T02:42:35Z

/retest

dims · 2022-12-07T02:45:39Z

@harshanarayana i was thinking that this is a bug in our implementation and not a new feature we are adding. So one way to avoid an explicit feature gate is by leaving the two new parameters to zero and leave the older behavior when these are zero. I'll try to find someone to help with reviews etc.

harshanarayana · 2024-02-09T11:20:32Z

Changes pushed based on comments from #114301 (comment)

dims · 2024-02-10T01:10:12Z

@liggitt finally (🤞🏾 ) it's ready. please /approve if appropriate.

liggitt · 2024-02-12T15:04:22Z

/approve

k8s-ci-robot · 2024-02-12T15:04:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, harshanarayana, liggitt, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/generated/openapi/OWNERS~~ [dims,liggitt]
~~pkg/kubelet/OWNERS~~ [dims,liggitt,mrunalp]
~~pkg/kubelet/apis/config/OWNERS~~ [liggitt]
~~staging/src/k8s.io/kubelet/config/OWNERS~~ [liggitt]
~~test/e2e_node/OWNERS~~ [dims,liggitt,mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kannon92 · 2024-02-12T15:13:37Z

#114301 (comment)

Now that this is approved, @harshanarayana can you open up a k/website PR documenting this?

harshanarayana · 2024-02-12T15:29:50Z

#114301 (comment)

Now that this is approved, @harshanarayana can you open up a k/website PR documenting this?

@kannon92 definitely.. Do you have a preferred section where you want rhinos updated? Also, what all do you think would be useful to be out inito the k/website?

kannon92 · 2024-02-12T15:34:44Z

#114301 (comment)
Now that this is approved, @harshanarayana can you open up a k/website PR documenting this?

@kannon92 definitely.. Do you have a preferred section where you want rhinos updated? Also, what all do you think would be useful to be out inito the k/website?

Maybe a new section here: https://kubernetes.io/docs/concepts/cluster-administration/logging/#log-rotation

cartermckinnon · 2024-02-12T17:35:46Z

pkg/kubelet/apis/config/validation/validation.go

+		allErrors = append(allErrors, fmt.Errorf("invalid configuration: containerLogMaxWorkers must be greater than or equal to 1"))
+	}
+
+	if kc.ContainerLogMonitorInterval.Duration.Seconds() < 3 {


How did you decide on 3 seconds?

@cartermckinnon It was based on this comment #114301 (comment)

harshanarayana · 2024-02-12T18:23:48Z

In order to perform an efficient log rotation in clusters where the volume of the logs generated by
the workload is large, kubelet also provides a mechanism to tune how the logs are rotated in
terms of how many concurrent log rotations can be performed and the interval at which the logs are
monitored and rotated as required. These attributes can be configured by setting `containerLogMaxWorkers`
and `containerLogMonitorInterval`.

@kannon92 Something like this under the log rotate section is good enough ?

kannon92 · 2024-02-12T18:25:26Z

@harshanarayana can you open up a PR on k/website? and we can iterate on the language. I think that is a good start.

harshanarayana · 2024-02-12T18:29:00Z

kubernetes/website#45105

@kannon92 Done. PTAL when you can

kannon92 · 2024-02-14T17:36:59Z

/lgtm

k8s-ci-robot · 2024-02-14T17:37:06Z

LGTM label has been added.

Git tree hash: dc3c24f6b2ae8143463e3f50dc35f7e0e4c02fe8

kannon92 · 2024-02-14T17:41:56Z

/hold cancel

k8s-ci-robot · 2024-02-14T19:29:28Z

@harshanarayana: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2	36e9118fbe5e03b281e3f505f5ad1a58b912ee2f	link	false	`/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-triage-robot · 2024-02-14T21:56:09Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

jberkus · 2024-02-22T00:57:46Z

Is there some reason this isn't marked kind/feature? It adds two new fields and a bunch of new, useful behavior

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Dec 6, 2022

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/code-generation labels Dec 6, 2022

k8s-ci-robot requested review from mrunalp and mwielgus December 6, 2022 06:22

harshanarayana force-pushed the kubelet/log-rotate-improvements branch 2 times, most recently from d155491 to 35a6c70 Compare December 6, 2022 08:35

k8s-ci-robot requested a review from dims December 6, 2022 08:53

harshanarayana force-pushed the kubelet/log-rotate-improvements branch 2 times, most recently from 2f1017f to 36e9118 Compare December 6, 2022 11:01

SergeyKanzhelev added this to Triage in SIG Node PR Triage Dec 6, 2022

kubelet: enable configurable rotation duration and parallel rotate

ab8c784

harshanarayana force-pushed the kubelet/log-rotate-improvements branch from 03ce326 to ab8c784 Compare February 9, 2024 11:19

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2024

cartermckinnon reviewed Feb 12, 2024

View reviewed changes

harshanarayana mentioned this pull request Feb 12, 2024

kubelet: enable configurable rotation duration and parallel rotate kubernetes/website#45105

Merged

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 14, 2024

k8s-ci-robot merged commit 5b2c919 into kubernetes:master Feb 14, 2024
15 checks passed

SIG Node PR Triage automation moved this from Needs Reviewer to Done Feb 14, 2024

k8s-ci-robot added this to the v1.30 milestone Feb 14, 2024

dims mentioned this pull request Feb 17, 2024

Bump containerLogMonitorInterval to the minimum allowed #123357

Merged

Rajalakshmi-Girish mentioned this pull request Feb 21, 2024

Flaking test- TestRotateLogs from kubelet/logs/container_log_manager_test #123414

Closed

dims added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 22, 2024

harshanarayana mentioned this pull request Feb 23, 2024

kubelet: fix flaky unit test with TestRotateLogs due to improper handling of goroutine in the tests #123469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet: enable configurable rotation duration and parallel rotate #114301

kubelet: enable configurable rotation duration and parallel rotate #114301

harshanarayana commented Dec 6, 2022 •

edited

k8s-ci-robot commented Dec 6, 2022

harshanarayana commented Dec 6, 2022

harshanarayana commented Dec 6, 2022

harshanarayana commented Dec 6, 2022

k8s-triage-robot commented Dec 6, 2022

harshanarayana commented Dec 6, 2022

dims commented Dec 7, 2022

dims commented Dec 7, 2022

dims commented Dec 7, 2022

harshanarayana commented Feb 9, 2024

dims commented Feb 10, 2024

liggitt commented Feb 12, 2024

k8s-ci-robot commented Feb 12, 2024

kannon92 commented Feb 12, 2024

harshanarayana commented Feb 12, 2024

kannon92 commented Feb 12, 2024

cartermckinnon Feb 12, 2024

harshanarayana Feb 12, 2024

harshanarayana commented Feb 12, 2024

kannon92 commented Feb 12, 2024

harshanarayana commented Feb 12, 2024

kannon92 commented Feb 14, 2024

k8s-ci-robot commented Feb 14, 2024

kannon92 commented Feb 14, 2024

k8s-ci-robot commented Feb 14, 2024 •

edited

k8s-triage-robot commented Feb 14, 2024

jberkus commented Feb 22, 2024

kubelet: enable configurable rotation duration and parallel rotate #114301

kubelet: enable configurable rotation duration and parallel rotate #114301

Conversation

harshanarayana commented Dec 6, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

What this PR does to mitigate this is enable two things.

How does this PR enable it ?

How does this work ?

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Additional Note

Note to Reviewers

k8s-ci-robot commented Dec 6, 2022

harshanarayana commented Dec 6, 2022

harshanarayana commented Dec 6, 2022

harshanarayana commented Dec 6, 2022

k8s-triage-robot commented Dec 6, 2022

harshanarayana commented Dec 6, 2022

dims commented Dec 7, 2022

dims commented Dec 7, 2022

dims commented Dec 7, 2022

harshanarayana commented Feb 9, 2024

dims commented Feb 10, 2024

liggitt commented Feb 12, 2024

k8s-ci-robot commented Feb 12, 2024

kannon92 commented Feb 12, 2024

harshanarayana commented Feb 12, 2024

kannon92 commented Feb 12, 2024

cartermckinnon Feb 12, 2024

Choose a reason for hiding this comment

harshanarayana Feb 12, 2024

Choose a reason for hiding this comment

harshanarayana commented Feb 12, 2024

kannon92 commented Feb 12, 2024

harshanarayana commented Feb 12, 2024

kannon92 commented Feb 14, 2024

k8s-ci-robot commented Feb 14, 2024

kannon92 commented Feb 14, 2024

k8s-ci-robot commented Feb 14, 2024 • edited

k8s-triage-robot commented Feb 14, 2024

jberkus commented Feb 22, 2024

harshanarayana commented Dec 6, 2022 •

edited

k8s-ci-robot commented Feb 14, 2024 •

edited