fix crio deadlock in getting crio sandbox containers#3838
fix crio deadlock in getting crio sandbox containers#3838dims merged 3 commits intogoogle:masterfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
2369483 to
b4df00e
Compare
|
@haircommander can you please review? |
| return false, false, nil | ||
| } | ||
|
|
||
| // When using systemd as the cgroup driver, sandbox containers don't have |
There was a problem hiding this comment.
this isn't the right way to check. users can still use cgroupfs if systemd is present. luckily, cri-o has for a while exposed the CgroupDriver in its info endpoint, which cadvisor is not currently using. can you update this to ask cri-o what cgroup driver its using, and branch that way?
There was a problem hiding this comment.
oh yeah that's much nicer
|
thanks for picking this up @olyazavr ! |
Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
|
LGTM |
|
@haircommander is there anything else I need to do here? Merging is blocked for me |
|
cc @dims 👀 |
In cri-o/cri-o#8748, I found that cadvisor v0.48.1 had a bug that caused crio to essentially deadlock if there was a long-terminating container and a kubelet restart. However, with cadvisor v0.49.0 this was fixed. With 0.52.1 it was bugged again.
I found that #3457 was the fix, which was then later reverted: #3565
1.27 has cadvisor v0.47.2 (working)
1.29 has cadvisor v0.48.1 (broken)
1.30 has cadvisor v0.49.0 (working)
1.31 has cadvisor v0.49.0 (working)
1.33 has cadvisor 0.52.1 (broken)
What happens here is that cadvisor finds sandbox containers in cgroups (they exist as cgroup directories), calls cri-o, cri-o returns 404 because sandbox containers aren't returned by the inspect endpoint, and then I suspect something ends up holding a mutex that prevents anything else to go through, which causes kubelet to fail to start because crio/cadvisor are stuck.
This re-does the original PR (#3457) but addresses the systemd/cgroupfs issue