stable-2.3 | kata-monitor: increase delay before syncing with the container manager #3554

fgiudici · 2022-01-25T17:32:25Z

This is meant to address #3550 on the stable branch.

Proposal

We are stable, so we probably don't want to backport #3553 here.
Since we use the sync with the container manager basically to just remove pods from the cache, we can wait some more time before syncing (and so reduce the chance to miss a kata pod just because it was not ready yet) without missing reporting any metric. The sync will just clean-up the sandbox cache from stale entries.

So, let's just raise the waiting time before starting the sync timer to an higher value: this will not fix the issue (if a pod has not completely started when the sync timer fires, we will miss its metrics till a new kata pod starts / is removed) but will make the issue to be much more unlikely to happen.

When we detect a new kata sandbox from the sbs fs, we add that to the sandbox cache to retrieve metrics. We also schedule a sync with the container manager, which we consider the source of truth: if the kata pod is not yet ready the container manager will not report it and we will drop it from our cache. We will add it back only when we re-sync, i.e., when we get an event from the sbs fs (which means a kata pod has been terminated or a new one has been started). Since we use the sync with the container manager to remove pods from the cache, we can wait some more before syncing (and so reduce the chance to miss a kata pod just because it was not ready yet). Let's raise the waiting time before starting the sync timer. Fixes: #3550 Signed-off-by: Francesco Giudici <fgiudici@redhat.com>

fgiudici · 2022-01-25T17:37:12Z

/test

c3d

Looks good to me.

Wonder if we could make that a configurable option in a later iteration?

fidencio · 2022-01-26T12:36:14Z

A head's up that may affect your PR.

In the Architecture Committee meeting from January 25th, 2022, the Architecture Committee has agreed on using the "Dismiss stale pull request approvals when new commits are pushed" configuration from GitHub. It basically means that if your PR has been rebased or updated, the approvals given will be erased.

In order to minimize traumas due to the new approach, please, consider adding a note on the changes done before the rebase / force-push, and also pinging the reviewers for a subsequent round of reviews.

Thanks for your understanding!

Related issue: kata-containers/community#249

fgiudici · 2022-01-26T13:05:24Z

Thanks for reviewing!

Wonder if we could make that a configurable option in a later iteration?

Yeah, I had that thought too.
I think that the proper fix is to stick with the sbs fs to populate the sandbox cache. In this way we could even completely drop the sync with the container manager: looked a bit too intrusive for a stable branch ;-) (in #3553 we just use the container manager to get the kubernetes metadata to attach to the metrics).
Here my idea is to just do the bare minimum, i.e., put a delay high enough to be confident that the pod will be running once we check the container manager (if no issues on startup happen).
I would just leave this static and move to the #3553 solution (adding a new parameter here shouldn't be that hard anyway).

fidencio

lgtm as a workaround and I do agree it's not worth pulling the full fix down to the stable branch.
Thanks, @fgiudici!

fgiudici · 2022-01-26T17:00:28Z

/retest-s390x

fgiudici changed the title ~~kata-monitor: increase delay before syncing with the container manager~~ stable-2.3 | kata-monitor: increase delay before syncing with the container manager Jan 25, 2022

c3d approved these changes Jan 26, 2022

View reviewed changes

fidencio approved these changes Jan 26, 2022

View reviewed changes

bergwolf merged commit 04426d6 into kata-containers:stable-2.3 Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stable-2.3 | kata-monitor: increase delay before syncing with the container manager #3554

stable-2.3 | kata-monitor: increase delay before syncing with the container manager #3554

fgiudici commented Jan 25, 2022

fgiudici commented Jan 25, 2022

c3d left a comment

fidencio commented Jan 26, 2022

fgiudici commented Jan 26, 2022

fidencio left a comment

fgiudici commented Jan 26, 2022

stable-2.3 | kata-monitor: increase delay before syncing with the container manager #3554

stable-2.3 | kata-monitor: increase delay before syncing with the container manager #3554

Conversation

fgiudici commented Jan 25, 2022

Proposal

fgiudici commented Jan 25, 2022

c3d left a comment

Choose a reason for hiding this comment

fidencio commented Jan 26, 2022

fgiudici commented Jan 26, 2022

fidencio left a comment

Choose a reason for hiding this comment

fgiudici commented Jan 26, 2022