Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a state liblxc function that has a timeout #4257

Closed
tomponline opened this issue Jan 11, 2023 · 20 comments
Closed

Provide a state liblxc function that has a timeout #4257

tomponline opened this issue Jan 11, 2023 · 20 comments
Assignees

Comments

@tomponline
Copy link
Collaborator

Inspired by https://github.com/lxc/lxd/issues/11227

Similar to the shutdown function that already takes a timeout.

This way if a container's monitor isn't responding, LXD can timeout getting its state without leaking a Go routine.

@tomponline tomponline changed the title Provide an option for a timeout to the state liblxc function Provide a timeout option to the state liblxc function Jan 11, 2023
@tomponline tomponline changed the title Provide a timeout option to the state liblxc function Provide a state liblxc function that has a timeout Jan 11, 2023
@mihalicyn
Copy link
Member

mihalicyn commented Jan 13, 2023

It's pretty strange that we are facing this issue.

This is the lxcapi handler for shutdown (with timeout support):


static bool do_lxcapi_shutdown(struct lxc_container *c, int timeout)
{
// ... definitions and so on

	if (!c)
		return false;

	if (!do_lxcapi_is_running(c)) // <<=== IMPORTANT CALL
		return true;

Then,

static bool is_stopped(struct lxc_container *c)
{
	lxc_state_t s;

	s = lxc_getstate(c->name, c->config_path);
	return (s == STOPPED);
}

static bool do_lxcapi_is_running(struct lxc_container *c)
{
	if (!c)
		return false;

	return !is_stopped(c);
}

So, do_lxcapi_is_running just wraps lxc_getstate.

Now let's look at our liblxc state handler:

static const char *do_lxcapi_state(struct lxc_container *c)
{
	lxc_state_t s;

	if (!c)
		return NULL;

	s = lxc_getstate(c->name, c->config_path);
	return lxc_state2str(s);
}

It means that if the state API call freezes, then shutdown will freeze too. Am I correct?

All stuff around timeout in shutdown is about waiting for the container "stopped state" to be reached. I think we need to investigate this particular user problem case more in-depth.

@tomponline
Copy link
Collaborator Author

tomponline commented Jan 13, 2023

@mihalicyn I've been investigating something which may (or may not) be related over in this PR:

https://github.com/lxc/lxd/pull/11278

In our LXD cluster evacuation tests we create some containers and then initiate a clean shutdown with a timeout of 5s (passed to liblxc via go-lxc), if the container doesn't response/shutdown in time (which is normal as we aren't waiting for it to fully start up so might not catch the signal) then LXD performs a forceful stop using go-lxc.

It seems that sometimes the forceful stop call either never returns or returns after a very long time (>30s).

The failure scenario we are seeing looks like this:
https://jenkins.linuxcontainers.org/job/lxd-github-pull-test/8609/arch=amd64,backend=zfs,compiler=golang-1.18,section=cluster/console#console-section-1

Because LXD gives up waiting for liblxc to return (and likely leaks a go routine because of it).

In PR https://github.com/lxc/lxd/pull/11278 if I perform a forceful stop first, and don't attempt a clean shutdown first, then it completes fine:
https://jenkins.linuxcontainers.org/job/lxd-github-pull-test/8602/

So looks like something is happening inside liblxc that hangs on forceful stop if a clean stop has timedout.

@mihalicyn
Copy link
Member

mihalicyn commented Jan 13, 2023

Ah, that makes the picture clearer! Thanks, Tom!

As far as I understand lxc/incus@dad047a
you've faced a situation when the shutdown API handle (even with timeout) failed to finish in the desired time, correct?

I just want to say, that we should explore this more in detail then, before adding some timeout support to the state handle. Because I can't imagine where to put this timeout to such a simple operation like getting container state...

@tomponline
Copy link
Collaborator Author

No, not quite, the inst.Shutdown(time.Duration(val) * time.Second) timesout in the duration specified, and returns an error.
But then LXD calls inst.Stop(false) and this calls into go-lxc c.Stop() function, which then hangs or takes a long time.
Calling just inst.Stop(false) without calling inst.Shutdown() (and returning an error) before, succeeds fine.

Because of the way LXD sets up LXC containers, we would expect liblxc to call our stop hooks (by way of an API call when the container has stopped). But I added some logging to LXD and I often didn't see that call arrive at LXD for >30s when the problems occurred.

But this is intermittent.

@tomponline
Copy link
Collaborator Author

I added logging to check it was a LXD bug, and I could see it hanging on the call to go-lxc's c.Stop() function.

@tomponline
Copy link
Collaborator Author

I only see this in the jenkins test runners, but if they are running a problematic kernel, then perhaps its to do with the same problem.

@mihalicyn
Copy link
Member

mihalicyn commented Jan 13, 2023

Do we have any evidence that state API call takes a long time to finish? Is it possible that something else makes this problem with the lxc list? As I mentioned above if state freeze, then shutdown (and stop) will freeze too. I just want to say that state API call is the most "light" API call that we can imagine. If it freezes then almost everything will freeze too.

@mihalicyn
Copy link
Member

mihalicyn commented Jan 13, 2023

The only place where we can wait too long is lxc_cmd_rsp_recv which is called from lxc_cmd: https://github.com/lxc/lxc/blob/master/src/lxc/commands.c#L168

Fun fact is that from the container daemon side we have ultra-light handler for the state command:

static int lxc_cmd_get_state_callback(int fd, struct lxc_cmd_req *req,
				      struct lxc_handler *handler,
				      struct lxc_async_descr *descr)
{
	struct lxc_cmd_rsp rsp = {
		.data = INT_TO_PTR(handler->state),
	};

	return lxc_cmd_rsp_send_reap(fd, &rsp);
}

It just can't freeze :) => the problem is somewhere in lower levels of the commands processing daemon. Possibly daemon is already executing some other command which is stuck somewhere. I'll try to reproduce such a situation locally with some hacks.
But I want to say that if some of our users experience this it would be useful to have stack traces for the container daemon process.

This also can be related to io_uring, cause the command processing daemon is using io_uring.
#4256

@tomponline
Copy link
Collaborator Author

@stgraber do you know what kernel the jenkins PR test runners user?

@stgraber
Copy link
Member

Ubuntu 5.15.0, usually 5.15.0-52 apparently

@mihalicyn
Copy link
Member

mihalicyn commented Jan 13, 2023

Might be useful https://github.com/lxc/lxd/issues/11228#issuecomment-1382037545

Btw, Ubuntu-5.15.0-52.58 contains only 4 fixups for this aa43477b04025 ("io_uring: poll rework") patch.

@mihalicyn
Copy link
Member

mihalicyn commented Jan 13, 2023

What we can do from our side, is to add some timeout to lxc_cmd_rsp_recv_fds to unblock the lxc_cmd call if something is wrong with the container daemon. But it's a global thing. Of course, we can add some conditions to apply this timeout for waiting for a response from the daemon only for the LXC_CMD_GET_STATE command. But I'm not sure that it's a fancy solution.
BTW, the concept of timeout in shutdown is totally different, because there (in shutdown) we are applying timeout on the container daemon side while processing a user command request, but here we want to apply timeout on waiting response. It means that after reaching timeout we'll forget about our command request, but the daemon will continue the execution...

@stgraber @brauner what do you think about adding such a timeout?

@stgraber
Copy link
Member

Sounds good to me. In general the monitor commands are all supposed to be quick, if that's not the case, there's something wrong going on and it's probably best to not remain hung at that stage.

@ewildgoose
Copy link

Hi, original reporter here. I might have accidentally confused the issue by referring to "shutdown". In fact I think the final result turned out to be that the "status" was blocking and this in turn led to some other functions getting stuck and blocked, including shutdown

So after more research, it seems like "lxc console", when using io_uring, will eventually (usually only a few seconds) get into a state where it's "stuck" (sorry, vague term, I mean this from a user perspective, without reference to the actual code). Whatever has happened, will then also cause the "lxc list" command to block and never return data to the end user

(Hence the confusion, when I originally experienced this, it was trying to shutdown an instance which was misbehaving, however, the deeper issue was clearly something like the status function not returning)

I accept that this is trying to solve for an undefined state that shouldn't ever appear! However, that it had happened to me, it was difficult to escape from, because I lacked information on which instance had "gone rogue", so I was rather in the dark trying to find which instance had stopped functioning. I suspect that there isn't a single point to fix this, since it should never happen (basically io_uring calls not working as advertised), but fixing key higher level functions to survive such a state seems helpful? Meaning, the ability for "shutdown" to be able to force kill a misbehaving instance, and the "lxc list" function surviving and ignoring the stuck instance, would both be very helpful.

Note, I think all the above is known and you are in agreement with it. Just wanted to explain how the initial report evolved from "cannot shutdown" to understanding that it's an io_uring issue which is affecting internal features such as used by "lxc list"

@mihalicyn
Copy link
Member

I would like to know your opinions about the following proposal.

  1. we are extending the go-lxc container API with the following method
// SetTimeout sets the response receive timeout for commands
func (c *Container) SetTimeout(timeout time.Duration)
  1. on the LXC side we are just setting SO_RCVTIMEO on container command socket in lxc_cmd_connect

    int lxc_cmd_connect(const char *name, const char *lxcpath,

  2. commands, which support timeout (on the daemon side) already like shutdown / wait will not be changed, but we also will apply client-side timeout on them too.

So, I'll post PR with changes for lxc, go-lxc, lxd.

cc @tomponline @stgraber

@stgraber
Copy link
Member

yeah, I think that'd work. It wouldn't cause an API break on go-lxc as that'd just be a new function we can call when setting up the go-lxc struct. On the liblxc side, we'll need something similar so we can set the timeout there, but that should similarly just be a new symbol.

As usual, we'll need to do the version detection dance in go-lxc so we can build against older liblxc and use the feature when it's available on newer versions.

@tomponline
Copy link
Collaborator Author

Yeah that sounds good thanks.

mihalicyn added a commit to mihalicyn/lxc that referenced this issue Jan 17, 2023
lxccontainer set_timeout method allows to set SO_RCVTIMEO
for LXC client socket.

This commit doesn't change behavior, because it's just
adds a new option and setter, but not changes any existing
LXC commands implementation. It's also extends internal API
function lxc_cmd with lxc_cmd_timeout.

Issue lxc#4257

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/lxc that referenced this issue Jan 17, 2023
Issue lxc#4257

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/lxc that referenced this issue Apr 18, 2023
lxccontainer set_timeout method allows to set SO_RCVTIMEO
for LXC client socket.

This commit doesn't change behavior, because it's just
adds a new option and setter, but not changes any existing
LXC commands implementation. It's also extends internal API
function lxc_cmd with lxc_cmd_timeout.

Issue lxc#4257

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/lxc that referenced this issue Apr 18, 2023
Issue lxc#4257

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/lxc that referenced this issue Apr 18, 2023
lxccontainer set_timeout method allows to set LXC client
timeout for waiting monitor response.

Right now, it's implemented using the SO_RCVTIMEO client
socket option. (But it's the implementation detail that
can be changed in the future.)

This commit doesn't change behavior, because it's just
adds a new option and setter, but not changes any existing
LXC commands implementation. It's also extends internal API
function lxc_cmd with lxc_cmd_timeout.

Issue lxc#4257

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/lxc that referenced this issue Apr 18, 2023
Issue lxc#4257

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/lxc that referenced this issue Apr 19, 2023
Issue lxc#4257

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/lxc that referenced this issue Apr 19, 2023
Issue lxc#4257

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/lxc that referenced this issue Apr 21, 2023
lxccontainer set_timeout method allows to set LXC client
timeout for waiting monitor response.

Right now, it's implemented using the SO_RCVTIMEO client
socket option. (But it's the implementation detail that
can be changed in the future.)

This commit doesn't change behavior, because it's just
adds a new option and setter, but not changes any existing
LXC commands implementation. It's also extends internal API
function lxc_cmd with lxc_cmd_timeout.

Issue lxc#4257

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
mihalicyn added a commit to mihalicyn/lxc that referenced this issue Apr 21, 2023
Issue lxc#4257

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
@hallyn
Copy link
Member

hallyn commented Aug 28, 2023

Is anyone working on a PR for this?

@mihalicyn
Copy link
Member

Hi Serge,

yep. I'm working on it.
#4260 was merged.
lxc/go-lxc#166 is on the way

@mihalicyn
Copy link
Member

It is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants