Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stable-2.3 | kata-monitor: increase delay before syncing with the container manager #3554

Merged
merged 1 commit into from
Jan 27, 2022
Merged

stable-2.3 | kata-monitor: increase delay before syncing with the container manager #3554

merged 1 commit into from
Jan 27, 2022

Conversation

fgiudici
Copy link
Contributor

This is meant to address #3550 on the stable branch.

Proposal

We are stable, so we probably don't want to backport #3553 here.
Since we use the sync with the container manager basically to just remove pods from the cache, we can wait some more time before syncing (and so reduce the chance to miss a kata pod just because it was not ready yet) without missing reporting any metric. The sync will just clean-up the sandbox cache from stale entries.

So, let's just raise the waiting time before starting the sync timer to an higher value: this will not fix the issue (if a pod has not completely started when the sync timer fires, we will miss its metrics till a new kata pod starts / is removed) but will make the issue to be much more unlikely to happen.

When we detect a new kata sandbox from the sbs fs, we add that to the
sandbox cache to retrieve metrics.
We also schedule a sync with the container manager, which we consider
the source of truth: if the kata pod is not yet ready the container
manager will not report it and we will drop it from our cache.
We will add it back only when we re-sync, i.e., when we get an event
from the sbs fs (which means a kata pod has been terminated or a new one
has been started).

Since we use the sync with the container manager to remove pods from the
cache, we can wait some more before syncing (and so reduce the chance to
miss a kata pod just because it was not ready yet).

Let's raise the waiting time before starting the sync timer.

Fixes: #3550

Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
@fgiudici fgiudici changed the title kata-monitor: increase delay before syncing with the container manager stable-2.3 | kata-monitor: increase delay before syncing with the container manager Jan 25, 2022
@fgiudici
Copy link
Contributor Author

/test

Copy link
Member

@c3d c3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Wonder if we could make that a configurable option in a later iteration?

@fidencio
Copy link
Member

A head's up that may affect your PR.

In the Architecture Committee meeting from January 25th, 2022, the Architecture Committee has agreed on using the "Dismiss stale pull request approvals when new commits are pushed" configuration from GitHub. It basically means that if your PR has been rebased or updated, the approvals given will be erased.

In order to minimize traumas due to the new approach, please, consider adding a note on the changes done before the rebase / force-push, and also pinging the reviewers for a subsequent round of reviews.

Thanks for your understanding!

Related issue: kata-containers/community#249

@fgiudici
Copy link
Contributor Author

Thanks for reviewing!

Wonder if we could make that a configurable option in a later iteration?

Yeah, I had that thought too.
I think that the proper fix is to stick with the sbs fs to populate the sandbox cache. In this way we could even completely drop the sync with the container manager: looked a bit too intrusive for a stable branch ;-) (in #3553 we just use the container manager to get the kubernetes metadata to attach to the metrics).
Here my idea is to just do the bare minimum, i.e., put a delay high enough to be confident that the pod will be running once we check the container manager (if no issues on startup happen).
I would just leave this static and move to the #3553 solution (adding a new parameter here shouldn't be that hard anyway).

Copy link
Member

@fidencio fidencio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm as a workaround and I do agree it's not worth pulling the full fix down to the stable branch.
Thanks, @fgiudici!

@fgiudici
Copy link
Contributor Author

/retest-s390x

@bergwolf bergwolf merged commit 04426d6 into kata-containers:stable-2.3 Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants