-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for waitInitExit() to return #1249
Conversation
@anmaxvl @helsaawy Assigning us three for this, small change but in a critical path. I'm going to run the cri-containerd tests with this and see if I can spot anything that would trip this up. Gabriel has ran the full integration test suite in upstream containerd (integration/client tests) and he's ran the TestRestartMonitor tests in a loop for a couple hours with no failures iirc |
Ok running this locally has the test fail for me, albeit for a new reason than before. Perhaps I'm missing a containerd side change needed or there's a difference in OS behavior on 20H2. @gabriel-samfira You tested this on ws2019 I assume? |
Yes, there is more than this issue making the restart monitor test fail. The other issue is that in some situations, sending a second I will open a separate issue detailing the error returned by |
I wonder if this What do you think? |
e31f95a
to
511321a
Compare
@@ -588,6 +589,27 @@ func (ht *hcsTask) DeleteExec(ctx context.Context, eid string) (int, uint32, tim | |||
case shimExecStateRunning: | |||
return 0, 0, time.Time{}, newExecInvalidStateError(ht.id, eid, state, "delete") | |||
} | |||
|
|||
if eid == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are cleaning up resources here when deleting the init task (since this triggers ht.close()
), should we also delete everything in ht.execs
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Let me dig a bit deeper and see if the entire hcsTask
object is not reaped somewhere after init dies and the init task gets deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR to remove the task object after successfully deleting the init exec:
0fb1e80
to
57417fb
Compare
b4dd6c7
to
6bf7520
Compare
// We've successfully deleted the init exec. Compute resources | ||
// should be closed, layers umounted and resources released. | ||
// This task should now be done. | ||
s.taskOrPod.Store(getEmptyPodOrTask(s.isSandbox)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the service is tracking a pod (s.isSandbox == true
), then there could be multiple tasks stored under s.getPod().workloadTasks
, and emptying the taskOrPod
store could prematurely close out other running tasks in the pod.
(I have a draft PR to address this: #1234)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh! Then I'll remove this bit and simply clean up the execs. Your branch will take care of this properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please have another look.
d2cf3e8
to
2d07914
Compare
lgtm |
@gabriel-samfira Sorry for the delay on check-in, we wanted to have one of us run some tests on this to verify the (now correct funnily enough 😆) behavior doesn't break any assumptions we may have built around the old behavior. Plan is to check this in today and get it backported to release/0.9 if everything's smooth. I'll also run the RestartMonitor test as well to see if that works for me now |
So TestRestartMonitor still fails for me but with Unless there was an extra fix needed in the integration tests that I'm forgetting @gabriel-samfira? |
This looks like a filesystem permissions issue. Try the following: icacls.exe "C:\Users\danny\AppData\Local\Temp" /grant "CREATOR OWNER":(OI)(CI)(IO)F /T This will give "CREATOR OWNER" full access rights on files they create. Similar to what the We had to do the same thing in the CI, here: https://github.com/containerd/containerd/blob/main/script/test/utils.sh#L157-L162 |
Ok this has been running in a loop on two machines, one ws2019 and one ws2022, with no issues for over an hour. I just ran the cri-containerd tests we have in /test in this repo and no issues observed either so things look A-Ok! |
2d07914
to
6a00fc5
Compare
The function passed into the Range function of sync.Map will stop the iteration if false is returned. This commit makes sure we iterate through all elements in the map. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
This change gives waitInitExit() a chance to cleanup resource when DeleteExec() is called, before returning. This should fix situations where the shim exits before releasing container resources. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
6a00fc5
to
a6edb25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…exit Wait for waitInitExit() to return (cherry picked from commit 6241c53)
…exit Wait for waitInitExit() to return (cherry picked from commit 6241c53) Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
[release/0.9] Wait for waitInitExit() to return #1249
This change gives waitInitExit() a chance to cleanup resource when DeleteExec() is called, before returning.
TL;DR
This should fix situations where the shim exits before releasing container resources.
TS;NM
I am not sure if this is a proper fix, or if it is just a band-aid over something that should be fixed elsewhere, but it did fix the
TestRestartMonitor
test incontainerd
. This issue also seems to cause flakiness in the ```TestContainerdRestart```` test due to the fact that lingering sandboxes are picked up by the test and fails.The restart monitor in
containerd
watches for situations where tasks end, and attempts to restart them. TheTestRestartMonitor
test will spawn a task, then send aKill()
signal to that task and wait for the restart monitor to respawn it.The problem is that after the task is killed, it is properly marked as
terminated
, but the shim exits beforewaitInitExit()
returns, so it doesn't have a chance to cleanup container resources (unprepare layers, etc). The test issues aDeleteTask()
, which, due to the fact that the task has been marked asterminated
does not fail any precondition and proceeds. Once the task is deleted, the shim exits (from my understanding), which if thewaitInitExit()
goroutine has not finished cleaning up container resources, will mean that lingering layers will remain mounted and in use:This gives the shim a chance to cleanup resources before deleting the task.
Signed-off-by: Gabriel Adrian Samfira gsamfira@cloudbasesolutions.com