-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a state
liblxc function that has a timeout
#4257
Comments
state
liblxc functionstate
liblxc function
state
liblxc functionstate
liblxc function that has a timeout
It's pretty strange that we are facing this issue. This is the lxcapi handler for
Then,
So, Now let's look at our liblxc
It means that if the All stuff around |
@mihalicyn I've been investigating something which may (or may not) be related over in this PR: https://github.com/lxc/lxd/pull/11278 In our LXD cluster evacuation tests we create some containers and then initiate a clean shutdown with a timeout of 5s (passed to liblxc via go-lxc), if the container doesn't response/shutdown in time (which is normal as we aren't waiting for it to fully start up so might not catch the signal) then LXD performs a forceful stop using go-lxc. It seems that sometimes the forceful stop call either never returns or returns after a very long time (>30s). The failure scenario we are seeing looks like this: Because LXD gives up waiting for liblxc to return (and likely leaks a go routine because of it). In PR https://github.com/lxc/lxd/pull/11278 if I perform a forceful stop first, and don't attempt a clean shutdown first, then it completes fine: So looks like something is happening inside liblxc that hangs on forceful stop if a clean stop has timedout. |
Ah, that makes the picture clearer! Thanks, Tom! As far as I understand lxc/incus@dad047a I just want to say, that we should explore this more in detail then, before adding some timeout support to the |
No, not quite, the Because of the way LXD sets up LXC containers, we would expect liblxc to call our stop hooks (by way of an API call when the container has stopped). But I added some logging to LXD and I often didn't see that call arrive at LXD for >30s when the problems occurred. But this is intermittent. |
I added logging to check it was a LXD bug, and I could see it hanging on the call to go-lxc's |
I only see this in the jenkins test runners, but if they are running a problematic kernel, then perhaps its to do with the same problem. |
Do we have any evidence that |
The only place where we can wait too long is Fun fact is that from the container daemon side we have ultra-light handler for the
It just can't freeze :) => the problem is somewhere in lower levels of the commands processing daemon. Possibly daemon is already executing some other command which is stuck somewhere. I'll try to reproduce such a situation locally with some hacks. This also can be related to io_uring, cause the command processing daemon is using io_uring. |
@stgraber do you know what kernel the jenkins PR test runners user? |
Ubuntu 5.15.0, usually 5.15.0-52 apparently |
Might be useful https://github.com/lxc/lxd/issues/11228#issuecomment-1382037545 Btw, Ubuntu-5.15.0-52.58 contains only 4 fixups for this |
What we can do from our side, is to add some timeout to @stgraber @brauner what do you think about adding such a timeout? |
Sounds good to me. In general the monitor commands are all supposed to be quick, if that's not the case, there's something wrong going on and it's probably best to not remain hung at that stage. |
Hi, original reporter here. I might have accidentally confused the issue by referring to "shutdown". In fact I think the final result turned out to be that the "status" was blocking and this in turn led to some other functions getting stuck and blocked, including shutdown So after more research, it seems like "lxc console", when using io_uring, will eventually (usually only a few seconds) get into a state where it's "stuck" (sorry, vague term, I mean this from a user perspective, without reference to the actual code). Whatever has happened, will then also cause the "lxc list" command to block and never return data to the end user (Hence the confusion, when I originally experienced this, it was trying to shutdown an instance which was misbehaving, however, the deeper issue was clearly something like the status function not returning) I accept that this is trying to solve for an undefined state that shouldn't ever appear! However, that it had happened to me, it was difficult to escape from, because I lacked information on which instance had "gone rogue", so I was rather in the dark trying to find which instance had stopped functioning. I suspect that there isn't a single point to fix this, since it should never happen (basically io_uring calls not working as advertised), but fixing key higher level functions to survive such a state seems helpful? Meaning, the ability for "shutdown" to be able to force kill a misbehaving instance, and the "lxc list" function surviving and ignoring the stuck instance, would both be very helpful. Note, I think all the above is known and you are in agreement with it. Just wanted to explain how the initial report evolved from "cannot shutdown" to understanding that it's an io_uring issue which is affecting internal features such as used by "lxc list" |
I would like to know your opinions about the following proposal.
So, I'll post PR with changes for lxc, go-lxc, lxd. |
yeah, I think that'd work. It wouldn't cause an API break on go-lxc as that'd just be a new function we can call when setting up the go-lxc struct. On the liblxc side, we'll need something similar so we can set the timeout there, but that should similarly just be a new symbol. As usual, we'll need to do the version detection dance in go-lxc so we can build against older liblxc and use the feature when it's available on newer versions. |
Yeah that sounds good thanks. |
lxccontainer set_timeout method allows to set SO_RCVTIMEO for LXC client socket. This commit doesn't change behavior, because it's just adds a new option and setter, but not changes any existing LXC commands implementation. It's also extends internal API function lxc_cmd with lxc_cmd_timeout. Issue lxc#4257 Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Issue lxc#4257 Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
lxccontainer set_timeout method allows to set SO_RCVTIMEO for LXC client socket. This commit doesn't change behavior, because it's just adds a new option and setter, but not changes any existing LXC commands implementation. It's also extends internal API function lxc_cmd with lxc_cmd_timeout. Issue lxc#4257 Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Issue lxc#4257 Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
lxccontainer set_timeout method allows to set LXC client timeout for waiting monitor response. Right now, it's implemented using the SO_RCVTIMEO client socket option. (But it's the implementation detail that can be changed in the future.) This commit doesn't change behavior, because it's just adds a new option and setter, but not changes any existing LXC commands implementation. It's also extends internal API function lxc_cmd with lxc_cmd_timeout. Issue lxc#4257 Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Issue lxc#4257 Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Issue lxc#4257 Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Issue lxc#4257 Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
lxccontainer set_timeout method allows to set LXC client timeout for waiting monitor response. Right now, it's implemented using the SO_RCVTIMEO client socket option. (But it's the implementation detail that can be changed in the future.) This commit doesn't change behavior, because it's just adds a new option and setter, but not changes any existing LXC commands implementation. It's also extends internal API function lxc_cmd with lxc_cmd_timeout. Issue lxc#4257 Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Issue lxc#4257 Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Is anyone working on a PR for this? |
Hi Serge, yep. I'm working on it. |
It is done. |
Inspired by https://github.com/lxc/lxd/issues/11227
Similar to the
shutdown
function that already takes a timeout.This way if a container's monitor isn't responding, LXD can timeout getting its state without leaking a Go routine.
The text was updated successfully, but these errors were encountered: