Problematic dimension handling of network metrics for host network pods/interfaces #2615

brancz · 2020-07-10T13:11:18Z

Prometheus convention is that any sum over a metric's dimensions should make sense. Metrics exposed by cAdvisor violate this in multiple occurrences, but here I would like to focus on one specific one: Kubernetes Pod's with hostNetwork: true.

The particularly confusing result of this is that the sum of all container's network traffic accounts in multiples for any container/pod with host network enabled.

I don't know cAdvisor's relationship with Kubernetes well enough to be able to say whether cAdvisor even knows about this, but a potential solution could be to exclude metrics for those that use host networking, and have 1 separate set of series just for host networking (or leave that up to an entirely separate component like node_exporter).

@dashpole

The text was updated successfully, but these errors were encountered:

dashpole · 2020-07-10T16:44:17Z

Completely agree that the current behavior is problematic. We might be able to do this by inspecting the container definition for different container runtimes. E.g. for docker, we can tell if it is using hostnetwork:

cadvisor/container/docker/handler.go

Line 75 in 65fa5b4

networkMode dockercontainer.NetworkMode

dashpole · 2020-07-10T16:44:38Z

I don't think we can do it generically, so it would be a per-runtime-integration change

dashpole · 2020-07-10T16:59:16Z

The only other consideration is backwards-compatibility... This makes sense for prometheus metrics, but i'm not sure if it makes sense for the summary API. I guess we could either re-insert node-level usage for pods that have hostNetwork: true, or instead of removing the streams, just provide a new metric (e.g. container_network_type) that we can join with network metrics to filter out host network metrics. Preference?

brancz · 2020-07-13T12:07:34Z

I do agree on backwards-compatibility. My preference would be the first, as the later would still violate the "sum of all must make sense" rule.

Jean-Daniel · 2021-02-24T22:15:51Z

Wouldn't it be possible to simply add a new label (container_network_type) instead of a new metric ?

It would then be at least easy to filter the pods using host network.

weibeld · 2021-03-12T00:49:47Z

A label to distinguish between host network pods and normal as proposed by @Jean-Daniel would already make things easier.

Currently, I distinguish between the two kinds of Pods as follows (Prometheus):

sum(rate(container_network_receive_bytes_total{interface!="eth0", id="/"}[5m]))
+
sum(rate(container_network_receive_bytes_total{interface="eth0", id!="/"}[5m]))

The above calculates the total receive data rate of the cluster. The first line represents the node network interfaces (that are used by the host network Pods). The id="/" label ensures that the value for a given network interface is counted only once (the time series with id="/" is present exactly once for each node network interface on a node and doesn't correspond to a specific host network Pod).

The second line represents the normal Pods (i.e. non host network). It happens in my cluster that all normal Pods have an eth0 interface and the nodes don't, so this selects only the time series corresponding to normal Pods.

This solution is certainly not generally applicable since network interfaces may be named differently in different clusters, or there may be multiple network interfaces per Pod. So, a label indicating whether a Pod is in the host network or not would already make things easier.

However, this still wouldn't meet the "sum must make sense" principle mentioned by @brancz. So, omitting the host network Pods from the metric entirely and having only a single time series for the entire network interface of a node might make sense.

vinayan3 · 2021-06-17T01:46:09Z

Pods with hostNetwork: true where the host has many network interfaces leads to a huge bloat of metrics. In my case I have a node which has a couple of daemonsets with host networking on. The node itself has multiple interfaces. What happens is that each pod for the daemonset has multiple metrics for each networking metric. This leads to high cardinality in metrics in very large clusters with many nodes.

A solution is to sum up all the network interfaces by the pod but I'm unsure if that is a correct way to handle a pod with hostNetwork: true.

google#2615 mentioned a WAI cAdvisor behavior for pods with `hostNetwork: true`. It should be mentioned in doc to avoid people spending days to do test and searching and then finally find it in issue list.

diafour mentioned this issue Nov 17, 2022

[monitoring-kubernetes] fix kube_pod_info{host_network=true} expression deckhouse/deckhouse#3001

Merged

4 tasks

deckhouse-BOaTswain mentioned this issue Nov 20, 2022

Backport: [monitoring-kubernetes] fix kube_pod_info{host_network=true} expression deckhouse/deckhouse#3080

Merged

4 tasks

anjmao mentioned this issue Dec 22, 2022

Track host network castai/egressd#11

Open

JamesRaynor67 mentioned this issue May 29, 2023

Update doc to note a hostnetwork pods' metrics known behavior #3316

Open

gberche-orange mentioned this issue Jan 17, 2024

Add metrics for received etcd calls k3s-io/kine#270

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problematic dimension handling of network metrics for host network pods/interfaces #2615

Problematic dimension handling of network metrics for host network pods/interfaces #2615

brancz commented Jul 10, 2020

dashpole commented Jul 10, 2020

dashpole commented Jul 10, 2020

dashpole commented Jul 10, 2020

brancz commented Jul 13, 2020

Jean-Daniel commented Feb 24, 2021

weibeld commented Mar 12, 2021 •

edited

vinayan3 commented Jun 17, 2021

Problematic dimension handling of network metrics for host network pods/interfaces #2615

Problematic dimension handling of network metrics for host network pods/interfaces #2615

Comments

brancz commented Jul 10, 2020

dashpole commented Jul 10, 2020

dashpole commented Jul 10, 2020

dashpole commented Jul 10, 2020

brancz commented Jul 13, 2020

Jean-Daniel commented Feb 24, 2021

weibeld commented Mar 12, 2021 • edited

vinayan3 commented Jun 17, 2021

weibeld commented Mar 12, 2021 •

edited