The image of a running container was deleted by the image garbage collection #123631

tomergayer · 2024-03-01T23:53:29Z

What happened?

The image of the container was deleted while pod was running.
When I noticed it, it was already almost two weeks after the pod was created, and by prometheus metrics, I could see at the time that the pod was created the kubelet performed image garbage collection on the node and images were deleted.

It might be worth mentioning that nothing other than the kubelet could have deleted the image from the node.

This situation happen a while ago, and I couldn't deal with it for too long while it happened, So unfortunately I don't have data to share. But by looking in the code the following scenario seems to make sense:

At the start of the garbage collection process the image was already in the node (for a while), but not in used yet.

The garbage collection get the current images and pods on that node from the runtime, because this image is not in use by any pod, it is considered unused. Ref:

kubernetes/pkg/kubelet/images/image_gc_manager.go

Lines 232 to 247 in b340ef2

    
           images, err := im.runtime.ListImages(ctx) 
        
           if err != nil { 
        
           	return imagesInUse, err 
        
           } 
        
           pods, err := im.runtime.GetPods(ctx, true) 
        
           if err != nil { 
        
           	return imagesInUse, err 
        
           } 
        
           // Make a set of images in use by containers. 
        
           for _, pod := range pods { 
        
           	for _, container := range pod.Containers { 
        
           		klog.V(5).InfoS("Container uses image", "pod", klog.KRef(pod.Namespace, pod.Name), "containerName", container.Name, "containerImage", container.Image, "imageID", container.ImageID) 
        
           		imagesInUse.Insert(container.ImageID) 
        
           	} 
        
           }

Before the image is actually deleted, A pod using this image is scheduling at this node. since the image is already on the node and the imagePullPolicy is IfNotPresent there is no need for pulling and the container is start to running. There is no knowledge about that the image is about to be deleted. Ref:

kubernetes/pkg/kubelet/images/image_manager.go

Lines 126 to 143 in b340ef2

    
           imageRef, err := m.imageService.GetImageRef(ctx, spec) 
        
           if err != nil { 
        
           	msg := fmt.Sprintf("Failed to inspect image %q: %v", container.Image, err) 
        
           	m.logIt(ref, v1.EventTypeWarning, events.FailedToInspectImage, logPrefix, msg, klog.Warning) 
        
           	return "", msg, ErrImageInspect 
        
           } 
        
           present := imageRef != "" 
        
           if !shouldPullImage(container, present) { 
        
           	if present { 
        
           		msg := fmt.Sprintf("Container image %q already present on machine", container.Image) 
        
           		m.logIt(ref, v1.EventTypeNormal, events.PulledImage, logPrefix, msg, klog.Info) 
        
           		return imageRef, "", nil 
        
           	} 
        
           	msg := fmt.Sprintf("Container image %q is not present with pull policy of Never", container.Image) 
        
           	m.logIt(ref, v1.EventTypeWarning, events.ErrImageNeverPullPolicy, logPrefix, msg, klog.Warning) 
        
           	return "", msg, ErrImageNeverPull 
        
           }

The garbage collection is still running, and it eventually comes to delete this image. There is no knowledge about that the image is in use now.

In short, I think that in the time between pods are detected from runtime to find the unused image to the time those images are actually deleted (the time between step 2 to step 4), new pods could be scheduling on that node, and make use of the images that will be deleted by the garbage collection immediately afterwards. Although the chances of this scenario are rare, the troublesome time I described would be bigger when the node could contain a lot of images and pods, and the difference between HighThresholdPercent and LowThresholdPercent have an effect as well.

What did you expect to happen?

The image garbage collection should not delete used images.

How can we reproduce it (as minimally and precisely as possible)?

Suppose an unused image X has already been on the node for some time.
The timeline is designed as following,

Any pod that uses image X and is scheduling on the node within the orange zone will cause this issue.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-15T00:36:28Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

local

OS version

No response

Install tools

No response

Container runtime (CRI) and version (if applicable)

CRI-O v1.27.3

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

The text was updated successfully, but these errors were encountered:

tomergayer · 2024-03-02T00:06:09Z

/sig node

Shubham1320 · 2024-03-02T11:00:24Z

@tomergayer do you only see this issue with cri-o?

AFAIK, if you are using docker as your container runtime, docker will block image deletion for any image which is being used by a running container (atleast this is how the docker cli behaves, not sure if the docker's api behaves in the same way)

so even if the kubelet gc triggers image deletion, docker will block the image deletion eventually if it has a running container using this image

Also, what are the effects you saw on the container which was using this image once the underlying image was deleted, did the container exit?

Shubham1320 · 2024-03-02T12:03:25Z

@tomergayer this is how containerd prevents the image from getting deleted if the image is referenced by any running container

containerd code ref

kannon92 · 2024-03-04T23:01:45Z

@saschagrunert @haircommander I know that we had some cri-o garbage collection bugs reported recently. Any ideas on this one?

kannon92 · 2024-03-04T23:04:31Z

The issue I am thinking of cri-o/cri-o#7143 but not sure if this is related.

saschagrunert · 2024-03-05T08:12:11Z

@kannon92 no that's not related. This issue here is speaking about a race between image GC and container creation.

saschagrunert · 2024-03-05T11:36:15Z

Hm, CRI-O also complains when an image is in use (as expected): https://github.com/cri-o/cri-o/blob/0f7786ab6b671828dc57d4c16b42dff1f32ec3cf/vendor/github.com/containers/storage/store.go#L2560

If it's not in use by the runtime, then I can only imagine that the container creation RPC is still in progress but the container has already started, otherwise the creation RPC would fail.

Avoid images being deleted which are still required because a container creation is currently in progress. This fixes a rare race between the image garbage collection and the kuberuntime manager. Fixes kubernetes#123631 Signed-off-by: Sascha Grunert <sgrunert@redhat.com>

kannon92 · 2024-03-05T14:11:00Z

/triage accepted
/priority important-longterm

Avoid images being deleted which are still required because a container creation is currently in progress. This fixes a rare race between the image garbage collection and the kuberuntime manager. Fixes kubernetes#123631 Signed-off-by: Sascha Grunert <sgrunert@redhat.com>

We now avoid removing images when a sandbox or container creation is in progress. This should close the gap in the outlined corner case in: kubernetes/kubernetes#123631 Refers to kubernetes/kubernetes#123711 as well. Signed-off-by: Sascha Grunert <sgrunert@redhat.com>

tomergayer · 2024-03-07T14:50:46Z

@Shubham1320 you are right that the runtime might block the remove request if the image is in use, but I don't think the kubelet should to make any assumption about that.

@tomergayer this is how containerd prevents the image from getting deleted if the image is referenced by any running container

containerd code ref

Hm, looks like you're referring to the wrong place, actually I don't know how containerd will behave in a case you try to remove a image referenced by running container, you should take a look here: https://github.com/containerd/containerd/blob/e53663cca75a3dd9c688a65a04f79350d6bb1fbd/internal/cri/server/images/image_remove.go#L36C31-L36C42
FYI: by using crictl rmi you can send a remove request to the runtime similar to the one sent by the kubelet.

@saschagrunert thanks for the PRs you opened, in kubelet as well as in cri-o. I have some thoughts about your fix in the kubelet. I wonder about the side affect of using a shared lock. Well, this lock will prevent any container (who already scheduling on that node) from starting on the node while the GC is running, and I think it's might not be acceptable in some cases (the delay that could happen in GC can also be a problem).
I can think of two other approaches to deal with it:

The problem is about images that are detected as unused by the GC, therefore may it is possible to lock only containers that use those images (the "mark for deletion" images), and by this way, not effect any other container.
Prevent scheduling of new pods on the node while GC is running. this could be implemented by something like ImageGCRunnig node condition.

The second approach makes more sense to me, as it prevents any locking or changing in kuberuntime, it make it clear why pods are not scheduling on that node, and allows you to control over it if necessary.

Please let me know what you think about it.

We now avoid removing images when a sandbox or container creation is in progress. This should close the gap in the outlined corner case in: kubernetes/kubernetes#123631 Refers to kubernetes/kubernetes#123711 as well. Signed-off-by: Sascha Grunert <sgrunert@redhat.com>

saschagrunert · 2024-03-08T08:39:20Z

The problem is about images that are detected as unused by the GC, therefore may it is possible to lock only containers that use those images (the "mark for deletion" images), and by this way, not effect any other container.

Yeah locking by container image would be an option for the fix.

Prevent scheduling of new pods on the node while GC is running. this could be implemented by something like ImageGCRunnig node condition.

The second approach makes more sense to me, as it prevents any locking or changing in kuberuntime, it make it clear why pods are not scheduling on that node, and allows you to control over it if necessary.

Please let me know what you think about it.

I like that idea, but that would mean we have to integrate it into a feature like PodDisruptionConditions, right?

tomergayer · 2024-03-09T01:30:00Z

Hm, unlike the PodDisruptionConditions this is for a bug fix, but because it impact the node status and scheduling behavior I agree we should integrate it into a feature. (Do you think it's worth opening an enhancement?)
I'm starting to work on PR to make the idea more clear.

saschagrunert · 2024-03-11T07:53:04Z

Do you think it's worth opening an enhancement?

I would prefer having it part of an existing one.

tomergayer · 2024-03-17T20:57:08Z

I have some thoughts about the implementation of the idea we discussed, currently the time that the kubelet checks for changes in node conditions depends on nodeStatusUpdateFrequency (default: 10 seconds).
To prevent the race between image GC and container creation by using something like ImageGCRunnig node condition, it is necessary that this condition be updated when the GC is triggered (disk usage over the threshold) and before the GC get the currents pods from runtime, therefore we can't wait to nodeStatusUpdateFrequency for update this condition.
An option to handle this could be to call the function syncNodeStatus when the update of this condition is necessary (image GC is triggered), but that means checking and updating the entire node status, which can be a little bit overhead for this case.
So we can add a function for a fast node condition sync, which only performs an update to a specific node condition. But it would require to make a bigger change in the kubelet.

@saschagrunert I'd love to hear what you think about it and whether you think it's worth adding this function and the changes that will be required for it to the kubelet for the benefit of this idea.

saschagrunert · 2024-03-22T08:49:49Z

which only performs an update to a specific node condition. But it would require to make a bigger change in the kubelet.

I think this would be a good way forward. It may have other implications performance wise, but do you think you could propose that as a draft PR?

tomergayer · 2024-03-27T23:31:10Z

but do you think you could propose that as a draft PR?

Yes, I just created one

We now avoid removing images when a sandbox or container creation is in progress. This should close the gap in the outlined corner case in: kubernetes/kubernetes#123631 Refers to kubernetes/kubernetes#123711 as well. Signed-off-by: Sascha Grunert <sgrunert@redhat.com>

tomergayer added the kind/bug Categorizes issue or PR as related to a bug. label Mar 1, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 1, 2024

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 2, 2024

SergeyKanzhelev added this to Triage in SIG Node Bugs Mar 4, 2024

saschagrunert linked a pull request Mar 5, 2024 that will close this issue

WIP: Add image garbage collection lock to kubelet #123711

Open

kannon92 moved this from Triage to Triaged in SIG Node Bugs Mar 5, 2024

saschagrunert mentioned this issue Mar 6, 2024

Add image removal lock to image server cri-o/cri-o#7850

Open

tomergayer linked a pull request Mar 27, 2024 that will close this issue

WIP: Add ImageGCRunning node condition #124088

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The image of a running container was deleted by the image garbage collection #123631

The image of a running container was deleted by the image garbage collection #123631

tomergayer commented Mar 1, 2024

tomergayer commented Mar 2, 2024

Shubham1320 commented Mar 2, 2024

Shubham1320 commented Mar 2, 2024 •

edited

Loading

kannon92 commented Mar 4, 2024

kannon92 commented Mar 4, 2024

saschagrunert commented Mar 5, 2024 •

edited

Loading

saschagrunert commented Mar 5, 2024

kannon92 commented Mar 5, 2024

tomergayer commented Mar 7, 2024

saschagrunert commented Mar 8, 2024

tomergayer commented Mar 9, 2024 •

edited

Loading

saschagrunert commented Mar 11, 2024

tomergayer commented Mar 17, 2024

saschagrunert commented Mar 22, 2024

tomergayer commented Mar 27, 2024

The image of a running container was deleted by the image garbage collection #123631

The image of a running container was deleted by the image garbage collection #123631

Comments

tomergayer commented Mar 1, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

tomergayer commented Mar 2, 2024

Shubham1320 commented Mar 2, 2024

Shubham1320 commented Mar 2, 2024 • edited Loading

kannon92 commented Mar 4, 2024

kannon92 commented Mar 4, 2024

saschagrunert commented Mar 5, 2024 • edited Loading

saschagrunert commented Mar 5, 2024

kannon92 commented Mar 5, 2024

tomergayer commented Mar 7, 2024

saschagrunert commented Mar 8, 2024

tomergayer commented Mar 9, 2024 • edited Loading

saschagrunert commented Mar 11, 2024

tomergayer commented Mar 17, 2024

saschagrunert commented Mar 22, 2024

tomergayer commented Mar 27, 2024

Shubham1320 commented Mar 2, 2024 •

edited

Loading

saschagrunert commented Mar 5, 2024 •

edited

Loading

tomergayer commented Mar 9, 2024 •

edited

Loading