Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The image of a running container was deleted by the image garbage collection #123631

Open
tomergayer opened this issue Mar 1, 2024 · 15 comments · May be fixed by #123711 or #124088
Open

The image of a running container was deleted by the image garbage collection #123631

tomergayer opened this issue Mar 1, 2024 · 15 comments · May be fixed by #123711 or #124088
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@tomergayer
Copy link

What happened?

The image of the container was deleted while pod was running.
When I noticed it, it was already almost two weeks after the pod was created, and by prometheus metrics, I could see at the time that the pod was created the kubelet performed image garbage collection on the node and images were deleted.

It might be worth mentioning that nothing other than the kubelet could have deleted the image from the node.

This situation happen a while ago, and I couldn't deal with it for too long while it happened, So unfortunately I don't have data to share. But by looking in the code the following scenario seems to make sense:

  1. At the start of the garbage collection process the image was already in the node (for a while), but not in used yet.
  2. The garbage collection get the current images and pods on that node from the runtime, because this image is not in use by any pod, it is considered unused. Ref:
    images, err := im.runtime.ListImages(ctx)
    if err != nil {
    return imagesInUse, err
    }
    pods, err := im.runtime.GetPods(ctx, true)
    if err != nil {
    return imagesInUse, err
    }
    // Make a set of images in use by containers.
    for _, pod := range pods {
    for _, container := range pod.Containers {
    klog.V(5).InfoS("Container uses image", "pod", klog.KRef(pod.Namespace, pod.Name), "containerName", container.Name, "containerImage", container.Image, "imageID", container.ImageID)
    imagesInUse.Insert(container.ImageID)
    }
    }
  3. Before the image is actually deleted, A pod using this image is scheduling at this node. since the image is already on the node and the imagePullPolicy is IfNotPresent there is no need for pulling and the container is start to running. There is no knowledge about that the image is about to be deleted. Ref:
    imageRef, err := m.imageService.GetImageRef(ctx, spec)
    if err != nil {
    msg := fmt.Sprintf("Failed to inspect image %q: %v", container.Image, err)
    m.logIt(ref, v1.EventTypeWarning, events.FailedToInspectImage, logPrefix, msg, klog.Warning)
    return "", msg, ErrImageInspect
    }
    present := imageRef != ""
    if !shouldPullImage(container, present) {
    if present {
    msg := fmt.Sprintf("Container image %q already present on machine", container.Image)
    m.logIt(ref, v1.EventTypeNormal, events.PulledImage, logPrefix, msg, klog.Info)
    return imageRef, "", nil
    }
    msg := fmt.Sprintf("Container image %q is not present with pull policy of Never", container.Image)
    m.logIt(ref, v1.EventTypeWarning, events.ErrImageNeverPullPolicy, logPrefix, msg, klog.Warning)
    return "", msg, ErrImageNeverPull
    }
  4. The garbage collection is still running, and it eventually comes to delete this image. There is no knowledge about that the image is in use now.

In short, I think that in the time between pods are detected from runtime to find the unused image to the time those images are actually deleted (the time between step 2 to step 4), new pods could be scheduling on that node, and make use of the images that will be deleted by the garbage collection immediately afterwards. Although the chances of this scenario are rare, the troublesome time I described would be bigger when the node could contain a lot of images and pods, and the difference between HighThresholdPercent and LowThresholdPercent have an effect as well.

What did you expect to happen?

The image garbage collection should not delete used images.

How can we reproduce it (as minimally and precisely as possible)?

Suppose an unused image X has already been on the node for some time.
The timeline is designed as following,
image
Any pod that uses image X and is scheduling on the node within the orange zone will cause this issue.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-15T00:36:28Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

local

OS version

No response

Install tools

No response

Container runtime (CRI) and version (if applicable)

CRI-O v1.27.3

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

@tomergayer tomergayer added the kind/bug Categorizes issue or PR as related to a bug. label Mar 1, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 1, 2024
@tomergayer
Copy link
Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 2, 2024
@Shubham1320
Copy link

@tomergayer do you only see this issue with cri-o?

AFAIK, if you are using docker as your container runtime, docker will block image deletion for any image which is being used by a running container (atleast this is how the docker cli behaves, not sure if the docker's api behaves in the same way)

so even if the kubelet gc triggers image deletion, docker will block the image deletion eventually if it has a running container using this image

Also, what are the effects you saw on the container which was using this image once the underlying image was deleted, did the container exit?

@Shubham1320
Copy link

Shubham1320 commented Mar 2, 2024

@tomergayer this is how containerd prevents the image from getting deleted if the image is referenced by any running container

containerd code ref

@SergeyKanzhelev SergeyKanzhelev added this to Triage in SIG Node Bugs Mar 4, 2024
@kannon92
Copy link
Contributor

kannon92 commented Mar 4, 2024

@saschagrunert @haircommander I know that we had some cri-o garbage collection bugs reported recently. Any ideas on this one?

@kannon92
Copy link
Contributor

kannon92 commented Mar 4, 2024

The issue I am thinking of cri-o/cri-o#7143 but not sure if this is related.

@saschagrunert
Copy link
Member

saschagrunert commented Mar 5, 2024

@kannon92 no that's not related. This issue here is speaking about a race between image GC and container creation.

@saschagrunert
Copy link
Member

Hm, CRI-O also complains when an image is in use (as expected): https://github.com/cri-o/cri-o/blob/0f7786ab6b671828dc57d4c16b42dff1f32ec3cf/vendor/github.com/containers/storage/store.go#L2560

If it's not in use by the runtime, then I can only imagine that the container creation RPC is still in progress but the container has already started, otherwise the creation RPC would fail.

saschagrunert added a commit to saschagrunert/kubernetes that referenced this issue Mar 5, 2024
Avoid images being deleted which are still required because a container
creation is currently in progress. This fixes a rare race between the
image garbage collection and the kuberuntime manager.

Fixes kubernetes#123631

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
@kannon92
Copy link
Contributor

kannon92 commented Mar 5, 2024

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 5, 2024
@kannon92 kannon92 moved this from Triage to Triaged in SIG Node Bugs Mar 5, 2024
saschagrunert added a commit to saschagrunert/kubernetes that referenced this issue Mar 6, 2024
Avoid images being deleted which are still required because a container
creation is currently in progress. This fixes a rare race between the
image garbage collection and the kuberuntime manager.

Fixes kubernetes#123631

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
saschagrunert added a commit to saschagrunert/cri-o that referenced this issue Mar 6, 2024
We now avoid removing images when a sandbox or container creation is in
progress. This should close the gap in the outlined corner case in:

kubernetes/kubernetes#123631

Refers to kubernetes/kubernetes#123711 as well.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
saschagrunert added a commit to saschagrunert/cri-o that referenced this issue Mar 6, 2024
We now avoid removing images when a sandbox or container creation is in
progress. This should close the gap in the outlined corner case in:

kubernetes/kubernetes#123631

Refers to kubernetes/kubernetes#123711 as well.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
saschagrunert added a commit to saschagrunert/cri-o that referenced this issue Mar 6, 2024
We now avoid removing images when a sandbox or container creation is in
progress. This should close the gap in the outlined corner case in:

kubernetes/kubernetes#123631

Refers to kubernetes/kubernetes#123711 as well.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
saschagrunert added a commit to saschagrunert/cri-o that referenced this issue Mar 6, 2024
We now avoid removing images when a sandbox or container creation is in
progress. This should close the gap in the outlined corner case in:

kubernetes/kubernetes#123631

Refers to kubernetes/kubernetes#123711 as well.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
saschagrunert added a commit to saschagrunert/cri-o that referenced this issue Mar 6, 2024
We now avoid removing images when a sandbox or container creation is in
progress. This should close the gap in the outlined corner case in:

kubernetes/kubernetes#123631

Refers to kubernetes/kubernetes#123711 as well.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
saschagrunert added a commit to saschagrunert/cri-o that referenced this issue Mar 7, 2024
We now avoid removing images when a sandbox or container creation is in
progress. This should close the gap in the outlined corner case in:

kubernetes/kubernetes#123631

Refers to kubernetes/kubernetes#123711 as well.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
saschagrunert added a commit to saschagrunert/cri-o that referenced this issue Mar 7, 2024
We now avoid removing images when a sandbox or container creation is in
progress. This should close the gap in the outlined corner case in:

kubernetes/kubernetes#123631

Refers to kubernetes/kubernetes#123711 as well.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
@tomergayer
Copy link
Author

@Shubham1320 you are right that the runtime might block the remove request if the image is in use, but I don't think the kubelet should to make any assumption about that.

@tomergayer this is how containerd prevents the image from getting deleted if the image is referenced by any running container

containerd code ref

Hm, looks like you're referring to the wrong place, actually I don't know how containerd will behave in a case you try to remove a image referenced by running container, you should take a look here: https://github.com/containerd/containerd/blob/e53663cca75a3dd9c688a65a04f79350d6bb1fbd/internal/cri/server/images/image_remove.go#L36C31-L36C42
FYI: by using crictl rmi you can send a remove request to the runtime similar to the one sent by the kubelet.

@saschagrunert thanks for the PRs you opened, in kubelet as well as in cri-o. I have some thoughts about your fix in the kubelet. I wonder about the side affect of using a shared lock. Well, this lock will prevent any container (who already scheduling on that node) from starting on the node while the GC is running, and I think it's might not be acceptable in some cases (the delay that could happen in GC can also be a problem).
I can think of two other approaches to deal with it:

  1. The problem is about images that are detected as unused by the GC, therefore may it is possible to lock only containers that use those images (the "mark for deletion" images), and by this way, not effect any other container.
  2. Prevent scheduling of new pods on the node while GC is running. this could be implemented by something like ImageGCRunnig node condition.

The second approach makes more sense to me, as it prevents any locking or changing in kuberuntime, it make it clear why pods are not scheduling on that node, and allows you to control over it if necessary.

Please let me know what you think about it.

saschagrunert added a commit to saschagrunert/cri-o that referenced this issue Mar 7, 2024
We now avoid removing images when a sandbox or container creation is in
progress. This should close the gap in the outlined corner case in:

kubernetes/kubernetes#123631

Refers to kubernetes/kubernetes#123711 as well.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
saschagrunert added a commit to saschagrunert/cri-o that referenced this issue Mar 7, 2024
We now avoid removing images when a sandbox or container creation is in
progress. This should close the gap in the outlined corner case in:

kubernetes/kubernetes#123631

Refers to kubernetes/kubernetes#123711 as well.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
@saschagrunert
Copy link
Member

  1. The problem is about images that are detected as unused by the GC, therefore may it is possible to lock only containers that use those images (the "mark for deletion" images), and by this way, not effect any other container.

Yeah locking by container image would be an option for the fix.

  1. Prevent scheduling of new pods on the node while GC is running. this could be implemented by something like ImageGCRunnig node condition.

The second approach makes more sense to me, as it prevents any locking or changing in kuberuntime, it make it clear why pods are not scheduling on that node, and allows you to control over it if necessary.

Please let me know what you think about it.

I like that idea, but that would mean we have to integrate it into a feature like PodDisruptionConditions, right?

@tomergayer
Copy link
Author

tomergayer commented Mar 9, 2024

Hm, unlike the PodDisruptionConditions this is for a bug fix, but because it impact the node status and scheduling behavior I agree we should integrate it into a feature. (Do you think it's worth opening an enhancement?)
I'm starting to work on PR to make the idea more clear.

@saschagrunert
Copy link
Member

Do you think it's worth opening an enhancement?

I would prefer having it part of an existing one.

@tomergayer
Copy link
Author

I have some thoughts about the implementation of the idea we discussed, currently the time that the kubelet checks for changes in node conditions depends on nodeStatusUpdateFrequency (default: 10 seconds).
To prevent the race between image GC and container creation by using something like ImageGCRunnig node condition, it is necessary that this condition be updated when the GC is triggered (disk usage over the threshold) and before the GC get the currents pods from runtime, therefore we can't wait to nodeStatusUpdateFrequency for update this condition.
An option to handle this could be to call the function syncNodeStatus when the update of this condition is necessary (image GC is triggered), but that means checking and updating the entire node status, which can be a little bit overhead for this case.
So we can add a function for a fast node condition sync, which only performs an update to a specific node condition. But it would require to make a bigger change in the kubelet.

@saschagrunert I'd love to hear what you think about it and whether you think it's worth adding this function and the changes that will be required for it to the kubelet for the benefit of this idea.

@saschagrunert
Copy link
Member

which only performs an update to a specific node condition. But it would require to make a bigger change in the kubelet.

I think this would be a good way forward. It may have other implications performance wise, but do you think you could propose that as a draft PR?

@tomergayer tomergayer linked a pull request Mar 27, 2024 that will close this issue
@tomergayer
Copy link
Author

but do you think you could propose that as a draft PR?

Yes, I just created one

saschagrunert added a commit to saschagrunert/cri-o that referenced this issue Apr 29, 2024
We now avoid removing images when a sandbox or container creation is in
progress. This should close the gap in the outlined corner case in:

kubernetes/kubernetes#123631

Refers to kubernetes/kubernetes#123711 as well.

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
5 participants