Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdout logs should not be accounted in ephemeral storage usage #124333

Open
P1ng-W1n opened this issue Apr 16, 2024 · 8 comments
Open

stdout logs should not be accounted in ephemeral storage usage #124333

P1ng-W1n opened this issue Apr 16, 2024 · 8 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@P1ng-W1n
Copy link

What happened?

This question has originally been reported under https://discuss.kubernetes.io/t/why-stdout-logs-are-accounted-in-ephemeral-storage-usage/27815.
However, the more I think about it, the more I'm convinced that this is an actual issue.
When Kubelet calculates ephemeral storage usage, it incorporates the CRI managed logs' usage in the calculation:
Code snippet from pkg/kubelet/eviction/eviction_manager.go:

func (m *managerImpl) podEphemeralStorageLimitEviction(podStats statsapi.PodStats, pod *v1.Pod) bool {
	podLimits := resourcehelper.PodLimits(pod, resourcehelper.PodResourcesOptions{})
	_, found := podLimits[v1.ResourceEphemeralStorage]
	if !found {
		return false
	}

	// pod stats api summarizes ephemeral storage usage (container, emptyDir, host[etc-hosts, logs])
	podEphemeralStorageTotalUsage := &resource.Quantity{}
	if podStats.EphemeralStorage != nil && podStats.EphemeralStorage.UsedBytes != nil {
		podEphemeralStorageTotalUsage = resource.NewQuantity(int64(*podStats.EphemeralStorage.UsedBytes), resource.BinarySI)
	}
	podEphemeralStorageLimit := podLimits[v1.ResourceEphemeralStorage]
	if podEphemeralStorageTotalUsage.Cmp(podEphemeralStorageLimit) > 0 {
		// the total usage of pod exceeds the total size limit of containers, evict the pod
		message := fmt.Sprintf(podEphemeralStorageMessageFmt, podEphemeralStorageLimit.String())
		if m.evictPod(pod, 0, message, nil, nil) {
			metrics.Evictions.WithLabelValues(signalEphemeralPodFsLimit).Inc()
			return true
		}
		return false
	}
	return false
}

However those log messages are completely out of control of the container itself, as it has no influence of their retention / lifecycle after leaving the container premises: Instead it's Kubelet's responsibility to define how long it wants to retain those logs, based on the --container-log-max-files , --container-log-max-size settings.

Because of this, if the container-log-max-size * container-log-max-files is more than the container's ‘ephemeral-storage’ limit, the container could be evicted (without writing a single byte to any of it's mounted volumes) over it's lifetime.

What did you expect to happen?

POD / container should not be evicted without reaching the container defined limit related to the ephemeral volumes under the container's control.

How can we reproduce it (as minimally and precisely as possible)?

PoC example:

---
apiVersion: v1
kind: Pod
metadata:
  name: log-test
spec:
  containers:
  - image: k8s.gcr.io/busybox:latest
    name: test
    command: ["/bin/sh"]
    args:
    - -c
    - |-
      yes $(printf 'Hello world!!!!\n%.0s' `seq 1 64`) | dd bs=1024 count=204800
      while sleep 3600; do
        true
      done
    resources:
      limits:
        ephemeral-storage: "10Mi"

Anything else we need to know?

No response

Kubernetes version

$ kubectl version --output=json
{
  "clientVersion": {
    "major": "1",
    "minor": "27",
    "gitVersion": "v1.27.6",
    "gitCommit": "741c8db18a52787d734cbe4795f0b4ad860906d6",
    "gitTreeState": "clean",
    "buildDate": "2023-09-13T09:21:34Z",
    "goVersion": "go1.20.8",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "kustomizeVersion": "v5.0.1",
  "serverVersion": {
    "major": "1",
    "minor": "28",
    "gitVersion": "v1.28.4",
    "gitCommit": "bae2c62678db2b5053817bc97181fcc2e8388103",
    "gitTreeState": "clean",
    "buildDate": "2023-11-15T16:48:54Z",
    "goVersion": "go1.20.11",
    "compiler": "gc",
    "platform": "linux/amd64"
  }
}

Cloud provider

N/A

OS version

No response

Install tools

No response

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

@P1ng-W1n P1ng-W1n added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 16, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@P1ng-W1n P1ng-W1n changed the title stdout logs not be accounted in ephemeral storage usage stdout logs should not be accounted in ephemeral storage usage Apr 16, 2024
@neolit123
Copy link
Member

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 16, 2024
@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Apr 16, 2024
@AnishShah
Copy link
Contributor

/cc @haircommander

@haircommander
Copy link
Contributor

However those log messages are completely out of control of the container itself, as it has no influence of their retention / lifecycle after leaving the container premises

I personally disagree with this, as it's the container that is logging in the first place. I understand the lifecycle point about the kubelet being the entity that choses when to clean logs, but allowing a container to define its own limits functionally opens the node up for DOS attacks from a container spamming the node and defining its own arbitrarily high limit. Kubelet is responsible for protecting the other workloads on the node, and if the container is being limited by the admin defined limits, it should log less, IMO

@P1ng-W1n
Copy link
Author

Hi @haircommander

allowing a container to define its own limits functionally opens the node up for DOS attacks from a container spamming the node and defining its own arbitrarily high limit.

I'm not sure that we're in sync regarding the context: I'm not asking/requesting for a feature to let the container define it's own log retention - on the contrary, I think the current design (where kubelet / CRI handles the lifecycle of container produced logs) is sufficient to cover that.
And because kubelet already provides configuration options to define the maximum amount of logs each container could consume on the worker node, so I don't see a way to attack the host by producing large volume of logs.

By default, the configuration allows 50 MiB (uncompressed) logs to be stored:
https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#options :

--container-log-max-files int32     Default: 5
	<Warning: Beta feature> Set the maximum number of container log files that can be present for a container. The number must be >= 2. (DEPRECATED: This parameter should be set via the config file specified by the kubelet's --config flag. See kubelet-config-file for more information.)
--container-log-max-size string     Default: 10Mi
	<Warning: Beta feature> Set the maximum size (e.g. 10Mi) of container log file before it is rotated. (DEPRECATED: This parameter should be set via the config file specified by the kubelet's --config flag. See kubelet-config-file for more information.)

Needless to say, cluster admin(s) might choose to configure these values differently (allowing more or less logs to be stored on a worker node, so the maximum amount of logs allowed on each worker node is completely up to the cluster admin to decide.

No, what I'm highlighting in this issue is kinda the opposite: If the container's ephemeral-storage limitation is more strict than the amount of logs kubelet maintain on the worker node, the container could be evicted due to the way ephemeral storage usage is calculated.

One might argue that containers should not produce that much logs to begin with, but this is just something that can't be guaranteed over the lifetime of the container:

  • Even if the container produces only a fraction of the allowed logs, over a long period of time those small amounts of logs do add up
  • Some containers are just "chatty" by default (even on INFO loglevel)
  • Other containers might need to run in debug/trace mode, where they can produce a huge volume of logs over a relatively short time

Examples:

  • ingress-nginx PODs are chatty by design, and they do require frequent rotation as a result (in case of that container, the 50 MiB default is only enough to hold a few hours worth of logs):
master-2:~ # crictl ps |grep ingress
49692998d8d53       5aa0bf4798fa2       2 months ago        Running             controller                                   0                   73af5d1f72713       ingress-controller-nginx-ingress-controller-q9zbb
master-2:~ # crictl inspect 49692998d8d53 |grep logPath
	"logPath": "/var/log/pods/ingress-nginx_ingress-controller-nginx-ingress-controller-q9zbb_9d51e25e-22f2-441b-82d2-e4dc6256a70b/controller/0.log",
master-2:~ # ls -l /var/log/pods/ingress-nginx_ingress-controller-nginx-ingress-controller-q9zbb_9d51e25e-22f2-441b-82d2-e4dc6256a70b/controller/
total 18840
-rw-r----- 1 root root  4710921 Apr 25 22:32 0.log
-rw-r--r-- 1 root root  1373597 Apr 25 20:50 0.log.20240425-202651.gz
-rw-r--r-- 1 root root  1372657 Apr 25 21:28 0.log.20240425-205044.gz
-rw-r--r-- 1 root root  1252749 Apr 25 22:12 0.log.20240425-212809.gz
-rw-r----- 1 root root 10560742 Apr 25 22:12 0.log.20240425-221224

// NOTE: Default ingress-nginx / coredns doesn't come with any ephemeral-storage limit definition, which is why they're not impacted by this issue (see the if podStats.EphemeralStorage != nil condition in podEphemeralStorageLimitEviction() )

Unfortunately the documentation doesn't do a good job of explaining how 'ephemeral-storage' limits are used either.
My original interpretation was that it limits the maximum ephemeral storage a container could use from the ephemeral volume (let's say a 500 MiB limit from a 2 GiB sized volume), but that turned out to be incorrect.
Instead the ephemeral-storage usage is calculated based on a factor that the container has no way of knowing, and therefore it can't even be generalized in the manifest either.

Because of this, it's probably safer to follow the example of ingress-nginx / coredns and not set any limit for ephemeral-storage to avoid "mysterious" POD evictions, but doesn't that make this feature pointless?

@AnishShah
Copy link
Contributor

Marking this as a feature since the cluster-admin will have to account for storage used by container logs separately while planning.

/kind feature
/remove-kind bug

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 1, 2024
@AnishShah AnishShah moved this from Triage to Triaged in SIG Node Bugs May 1, 2024
@P1ng-W1n
Copy link
Author

P1ng-W1n commented May 1, 2024

Marking this as a feature since the cluster-admin will have to account for storage used by container logs separately while planning.

I would argue that this is already the case: When the cluster admin designs / dimensions a cluster node, it already has to account the number of PODs the system will have to support, and the maximum amount of logs those PODs (and containers related to those PODs) could generate (and thus to configure the --container-log-max-files and --container-log-max-size accordingly, depending on the requirements)

Again - the issue I'm trying to highlight here is the disconnect between the worker node [1] and the container [2] configuration.

[1] kubelet - controls the amount of logs a container could have on a given worker node
[2] manifest - defines the container requirements according to 'Container v1 core', including the resource requests ad limits

Configurations / settings related to these two are completely disconnected from each other, however the POD eviction logic connects them indirectly (by accounting logs that are maintained by kubelet on the worker node).

Because of this I don't see the feature flag justifiable (feels more like a design bug).

@haircommander
Copy link
Contributor

I brought this up at SIG node meeting today, and @SergeyKanzhelev brought up a good point. The design of logs being charged against the pod's ephemeral storage may feel like an anti-pattern, but that's because it's tricky to fully charge the pod for all of its behavior. We discussed pushing towards a future where the cost of pods on the infrastructure (kubelet, CRI) is not linear with the number of pods. That's an ambitious goal because they both do a lot on a per-pod basis, but a good reach goal would be to have the infrastructure not be affected by increasing number of pods, and instead have each pod charged for all of its behavior. The way ephemeral storage accounts for logs fulfills that goal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
Development

No branches or pull requests

5 participants