Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: remove kata_shim_netdev metric #9100

Merged
merged 2 commits into from Feb 26, 2024

Conversation

littlejawa
Copy link
Contributor

Following discussion on #5738 - I think the kata_shim_netdev metrics is unneeded.
It is not only using too much memory (causing Prometheus pods to be OOM-killed in some installations), but it also reports data from interfaces that are unrelated to kata (no namespace isolation allowing the reporting of host-wide metrics).

The shim network usage is not giving any useful insight anyway - we already have hypervisor and agent metrics for that.
So rather than trying to fix the memory/namespace question, I'm suggesting to just remove the shim netdev metrics.

I'm also doing it on the rust implementation, even if it's not impacted by the initial bug as far as I can tell: In the rutime-rs case, the metrics are isolated (not verified, but that's my understanding at this point). Also it seems we don't have hypervisor netdev metrics coming from runtime-rs, but we still have the agent's network metrics.
I feel it's worth removing anyway, if only for consistency, but I'm making it in a separate commit in case reviewers ask to keep it as it is.

@gkurz
Copy link
Member

gkurz commented Feb 21, 2024

/test

Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @littlejawa !

@gkurz
Copy link
Member

gkurz commented Feb 22, 2024

go: downloading github.com/safchain/ethtool v0.2.0
verifying github.com/safchain/ethtool@v0.2.0: checksum mismatch
	downloaded: h1:dILxMBqDnQfX192cCAPjZr9v2IgVXeElHPy435Z/IdE=
	go.sum:     h1:tjsEsesUSlGdnUAAiIaEvk/YEycwk0k3Q6/q77qGpBI=

@littlejawa you need to rebase to get the fix from #9112 .

As part of the shim network metrics, the shim is reporting network interfaces
from the host with no namespace isolation - this gives insight in interfaces
not tied to the kata containers, and causes an increase in resource usage for
kata metrics.

As the shim itself is not using the network (all its communication with
other processes is done with local unix sockets), there is no reason to
keep gathering and reporting shim-specific network metrics.
Actual network usage of the kata containers can be found from the existing
hypervisor network metrics (kata_hypervisor_netdev) and from the agent
network metrics (kata_guest_netdev_stat).

Fixes: kata-containers#5738

Signed-off-by: Julien Ropé <jrope@redhat.com>
For consistency with the go runtime.
As the shim itself is not using the network (all its communication with
other processes is done with local unix sockets), there is no reason to
keep gathering and reporting shim-specific network metrics.
Actual network usage of the kata containers can be found from the existing
agent network metrics (kata_guest_netdev_stat).

Signed-off-by: Julien Ropé <jrope@redhat.com>
@gkurz
Copy link
Member

gkurz commented Feb 22, 2024

/test

Copy link
Member

@justxuewei justxuewei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, thanks!

@justxuewei justxuewei merged commit bb5e33b into kata-containers:main Feb 26, 2024
289 of 296 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/medium Average sized task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants