Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeouts and cleanup #969

Closed
pohly opened this issue Apr 22, 2022 · 31 comments
Closed

timeouts and cleanup #969

pohly opened this issue Apr 22, 2022 · 31 comments

Comments

@pohly
Copy link
Contributor

pohly commented Apr 22, 2022

In Kubernetes, we would like to have:

  • an overall timeout for the entire suite (-ginkgo.timeout, with a very high value because the suite is large)
  • a per spec timeout (???, smaller, to avoid blocking for the entire -ginkgo.timeout duration on a single spec)
  • timeouts for AfterEach specs (we try to clean up, but cleaning up itself may get stuck)

So far I have only found -ginkgo.timeout. One downside of it is that it also aborts cleanup operations as soon as those block, which is not useful for Kubernetes because the cleanup operation may involved communication with the apiserver to remove objects.

I tried with this:

var _ = ginkgo.Describe("test", func() {
	ginkgo.BeforeEach(func() {
		fmt.Fprint(ginkgo.GinkgoWriter, "hello\n")
	})

	ginkgo.AfterEach(func() {
		defer fmt.Fprint(ginkgo.GinkgoWriter, "done\n")
		fmt.Fprint(ginkgo.GinkgoWriter, "world\n")
		time.Sleep(time.Hour)
	})

	ginkgo.It("times out", func() {
		time.Sleep(time.Hour)
	})
})

When I run with -ginkgo.timeout=10s -ginkgo.v -ginkgo.progress, I get:

test
  times out
  /nvme/gopath/src/github.com/intel/pmem-csi/test/e2e/e2e.go:177
[BeforeEach] test
  /nvme/gopath/src/github.com/intel/pmem-csi/test/e2e/e2e.go:167
hello
[It] times out
  /nvme/gopath/src/github.com/intel/pmem-csi/test/e2e/e2e.go:177
[AfterEach] test
  /nvme/gopath/src/github.com/intel/pmem-csi/test/e2e/e2e.go:171
world
------------------------------
•! [INTERRUPTED] [10.933 seconds]
test
/nvme/gopath/src/github.com/intel/pmem-csi/test/e2e/e2e.go:166
  [It] times out
  /nvme/gopath/src/github.com/intel/pmem-csi/test/e2e/e2e.go:177

  Begin Captured GinkgoWriter Output >>
    [BeforeEach] test
      /nvme/gopath/src/github.com/intel/pmem-csi/test/e2e/e2e.go:167
    hello
    [It] times out
      /nvme/gopath/src/github.com/intel/pmem-csi/test/e2e/e2e.go:177
    [AfterEach] test
      /nvme/gopath/src/github.com/intel/pmem-csi/test/e2e/e2e.go:171
    world
  << End Captured GinkgoWriter Output

  Interrupted by Timeout

  Here's a stack trace of all running goroutines:
  ...

Note that the cleanup spec didn't run it's defer.

We could provide per-spec timeouts via ctx := context.Timeout(context.Background(), 10*time.Minute). We then need to be careful that the cleanup spec doesn't use the same context because it wouldn't get any work done after a timeout. The downside of this is that we would have to touch a lot of code in Kubernetes, which is always a daunting prospect.

IMHO it would be simpler to have a default -ginkgo.it-timeout (for It specs), -ginkgo.after-timeout (for AfterEach) and perhaps a Timeout(time.Duration) decorator to override those defaults.

@onsi
Copy link
Owner

onsi commented Apr 26, 2022

hey there - lots to dig into here. First the behavior of -ginkgo.timeout:

One downside of it is that it also aborts cleanup operations as soon as those block, which is not useful for Kubernetes because the cleanup operation may involved communication with the apiserver to remove objects.

When a timeout happens Ginkgo stops execution of the current spec (or specs, if running in parallel) and then starts to unwind the suite and clean up. It runs through each relevant After* node to do that. There is, however, a chance that a given After* node will hang so Ginkgo needs to keep an eye on things and make sure that the After* nodes finish eventually. It does this, under the hood, by repeatedly sending a signal to a channel. Whenever that signal is received any currently running After* node is aborted and a next node is picked up. Currently the math to figure out the duration between signals sent on that channel is here.

As you can see there are some hard-coded values in there. I'de be happy to expose those values as command-line flags so you can configure them. Doing so would allow you to give your suite an arbitrarily long time to run the After* nodes. Let me know if you'd like me to do that.


Ginkgo used to have support for per-spec timeouts but I've removed that support. The link goes into some of the why but one of the main reasons is because timed-out specs continue to run in the background. If/when they eventually fail the failure can get erroneously registered with the currently running spec and lead to some hella confusing output that can be hard to debug. (This is all a consequence of the design decision, made long ago, to give users a DSL where you don't have to pass anything into each Ginkgo node. The resulting reliance on global state has been... ok but one of the biggest pain points is this edge case: I simply have no way (without hacking at goroutine ids) to associate a failure with a given goroutine).

But I actually have come to the point of view that per-spec timeouts really isn't the right long-term solution anyway. Rather it's better to rely on Gomega's async assertions to pepper appropriate timeouts on calls that could potentially time out. This gives you the context you need when a timeout failure occurs ("timed out waiting for such-and-such api call to happen" vs "this spec timed out. somewhere.") and it cleans up the spec nicely because subsequent operations in the spec do not run (except for the cleanup After* nodes).

I think y'all are using Eventually already, correct?

Note that if you do want to provide spec-level timeouts that is possible via the migration strategy proposed in the Migration doc. So there's nothing preventing implementing a class of specs governed by timeouts. However I do recommend the more granular approach described above.

@pohly
Copy link
Contributor Author

pohly commented Apr 26, 2022

There is, however, a chance that a given After* node will hang so Ginkgo needs to keep an eye on things and make sure that the After* nodes finish eventually. It does this, under the hood, by repeatedly sending a signal to a channel. Whenever that signal is received any currently running After* node is aborted and a next node is picked up.

How does it abort a running spec, i.e. which mechanism is used to force a Go function to return? Just curious.

As you can see there are some hard-coded values in there. I'de be happy to expose those values as command-line flags so you can configure them. Doing so would allow you to give your suite an arbitrarily long time to run the After* nodes. Let me know if you'd like me to do that.

I think that would be useful. Exposing this as options then also is an opportunity to document this behavior in a user-visible place. At least I didn't see anything about this aspect.

But I actually have come to the point of view that per-spec timeouts really isn't the right long-term solution anyway. Rather it's better to rely on Gomega's async assertions to pepper appropriate timeouts on calls that could potentially time out.

I tend to agree, and as I said somewhere (not sure anymore where), Kubernetes should have timeouts defined pretty much everywhere, so this works for us.

I think y'all are using Eventually already, correct?

I wish we did, but unfortunately not.

@pohly
Copy link
Contributor Author

pohly commented Apr 26, 2022

I tend to agree, and as I said somewhere (not sure anymore where), Kubernetes should have timeouts defined pretty much everywhere, so this works for us.

However, those are usually local timeouts, not per-spec timeouts. So if a spec creates 10 pods, each pod creation gets the same 5 minute timeout instead of having an overall timeout for the spec. I've seen cases where something was wrong, but not wrong enough to time out any of the individual steps, so the spec kept running although it would have been better to abort after a certain per-spec timeout.

@pohly
Copy link
Contributor Author

pohly commented Apr 26, 2022

It's also hard to review how long the entire spec is expected to run. One would have to sum up the individual timeouts, of which most are hidden behind helper functions.

@onsi
Copy link
Owner

onsi commented Apr 26, 2022

How does it abort a running spec, i.e. which mechanism is used to force a Go function to return? Just curious.

I wish I could forcibly end the go function. But that isn't possible. It just moves on to other things (so you actually potentially end up with multiple things running in the background when a timeout occurs and is unwinding)

I think that would be useful. Exposing this as options then also is an opportunity to document this behavior in a user-visible place. At least I didn't see anything about this aspect.

OK sounds good. I'll work on it and will try to ship something soon.

It's also hard to review how long the entire spec is expected to run. One would have to sum up the individual timeouts, of which most are hidden behind helper functions.

that's also part of the challenge - what should you set the spec-level tiemout to? In any event, as I said, y'all could implement spec-level timeouts on a case-by-case basis using the pattern in the migration doc.

@pohly
Copy link
Contributor Author

pohly commented Apr 26, 2022

I wish I could forcibly end the go function. But that isn't possible.

That's what I thought.

It just moves on to other things (so you actually potentially end up with multiple things running in the background when a timeout occurs and is unwinding)

So you are saying that when something times out, the code keeps running in the background and ginkgo just "pretends" that the It or AfterEach has stopped?

Then why does my example not print "stopped" via defer fmt.Fprint(ginkgo.GinkgoWriter, "done\n")?

@onsi
Copy link
Owner

onsi commented Apr 26, 2022

your example doesn't print "stop" because the test process exits after the last AfterEach node is "terminated"

does that make sense? in Go one goroutine can't cause a different goroutine to exit. so Ginkgo has to let these timed-out goroutines "leak" - but since a timeout has occurred Ginkgo's goal is to try to run as many of the cleanup nodes as possible and then exit. so when the last clean up node times out there aren't any others to run and Ginkgo invokes os.Exit()

@pohly
Copy link
Contributor Author

pohly commented Apr 27, 2022

That makes sense.

However, I am now worried about the following situation:

  • An It node is running when the suite times out. It keeps running because ginkgo cannot kill the goroutine.
  • ginkgo starts the AfterEach node in parallel. The AfterEach code uses the same variables as the It code, so they now conflict.

Can this happen? If not, how does ginkgo prevent it?

My assumption was that ginkgo raises a panic when it gets the chance (for example, because By is called in a long-running spec) to exit the It function. I haven't checked whether that assumption is true and now that I think about it some more I see problems with such an approach (By then might also be called by the AfterEach code, in which case it should not panic, or at least not yet).

@onsi
Copy link
Owner

onsi commented Apr 27, 2022

hey @pohly - yes you have it right.

I think of suite timeouts as exceptional situations (similar to user-initiated interrupts like sending a SIGINT). The intent is to say "the suite has failed - do your best to clean up and exit ASAP." Ginkgo then does a best-effort cleanup that is predicated on the assumption that everything is shutting down ASAP so let's tear things down.

So, yes, the It may still be running while the After* are running. In practice though that It is either stuck (which is why the timeout/interrupt happened) or will fail naturally as the After* clean up from under it (which is... fine, since the whole suite is suspect/failed and our goal is to exit as cleanly as reasonably possible, ASAP). In general I haven't seen too many issues arise from this approach - the cleanups usually get pretty far in succeeding (especially if they are well-designed) and, absent a mechanism in Go to kill a sibling goroutine there really are no pragmatic alternatives.

If you're observing or expecting a specific concern based on this behavior let me know.

@pohly
Copy link
Contributor Author

pohly commented Apr 27, 2022

So, yes, the It may still be running while the After* are running.

It might be worth calling that out as a caveat in the documentation, if it's not there already. I might have missed it.

In practice though that It is either stuck (which is why the timeout/interrupt happened)

Not in my experience. The E2E suite in Kubernetes can sometimes run much more slowly than usual, without any particular test hanging. Then when the timeout occurs in a CI job, some random spec is still active and running.

When debugging interactively, I tend to use CTRL-C to abort when I see that I invoked the suite incorrectly or I am not interested in further results, which also happens to be in the middle of some healthy spec. If I then see that some AfterEach is hanging, I might try CTRL-C again to indicate that now I really want to exist ASAP.

My expectation in both cases is that ginkgo shuts down cleanly (assuming that AfterEach doesn't hang), without relying on winning race conditions.

the cleanups usually get pretty far in succeeding (especially if they are well-designed)

We track resources that have to be cleaned up, so calling AfterEach after an aborted It should usually be fine. But our code is not prepared to run concurrently.

absent a mechanism in Go to kill a sibling goroutine there really are no pragmatic alternatives.

I can think of several solutions:

  • raise a panic each time Ginkgo is entered while timed out (By, output to GinkgoWriter), but then stop doing that once It has returned and AfterEach are about to run
  • set up a context with context.WithCancel and grant specs access to that, then when timed out, cancel that context

The second approach relies on specs being written to use that context instead of context.Background() or context.TODO(). It has the advantage that additional goroutines that may have been spawned also get cancelled, assuming that they are controlled by a context or the stop channel associated with that context. At least in Kubernetes, that is a common pattern. Obviously code will have to be adapted, but this should be doable.

@onsi
Copy link
Owner

onsi commented Apr 27, 2022

It might be worth calling that out as a caveat in the documentation, if it's not there already. I might have missed it.

I'm not sure if it's there but I'll double check and make sure to add it.

raise a panic each time Ginkgo is entered while timed out (By, output to GinkgoWriter), but then stop doing that once It has returned and AfterEach are about to run

this actually isn't possible - there's no way for By/GinkgoWriter/Gomega to know whether or not they are running in an It or in an AfterEach.

set up a context with context.WithCancel and grant specs access to that, then when timed out, cancel that context

today there isn't a great way for Ginkgo to inject information into specs. I could imagine a GinkgoContext() that specs could ask for that is wired up with the correct timeout. FWIW users can basically implement this themselves today, by creating a Context and setting a deadline on it.

I'm open to doing that and will add it to the backlog - but I can't give you a confident release date yet.

While a cancellable context does help it doesn't necessarily guarantee race-free clean up. Ginkgo would cancel the context and then immediately start running the After* nodes to clean up. There will be a window in there where your spec is still running contemporaneously with the clean up code.

@pohly
Copy link
Contributor Author

pohly commented Apr 27, 2022

this actually isn't possible - there's no way for By/GinkgoWriter/Gomega to know whether or not they are running in an It or in an AfterEach

Isn't it so that there's always only one active node in a process? At least internally Gingko should be able to identify that node and then panic only when the active node is an It.

FWIW users can basically implement this themselves today, by creating a Context and setting a deadline on it.

They could, if there was an API to determine the time that a spec still has left. I don't think there is one at the moment. The advantage of getting a context from Ginkgo is that it can be integrated with signal handling: Ginkgo can cancel that context when it receives a signal or when the timeout is reached, whatever happens first.

While a cancellable context does help it doesn't necessarily guarantee race-free clean up.

That's true. Ginkgo would also have to wait for the It node to return. If it doesn't return by the time that the suite timeout is reached, ginkgo has to exit the process without cleaning up.

Basically the promise of an It spec becomes "I will return quickly when I have to" and ginkgo promises "I will not run nodes concurrently".

@onsi
Copy link
Owner

onsi commented Apr 27, 2022

Isn't it so that there's always only one active node in a process?

Yes, except in this one context where an interrupt/timeout has occurred and Ginkgo is cleaning up as quickly as possible.

They could, if there was an API to determine the time that a spec still has left. I don't think there is one at the moment. The advantage of getting a context from Ginkgo is that it can be integrated with signal handling: Ginkgo can cancel that context when it receives a signal or when the timeout is reached, whatever happens first.

yep, a GinkgoContext() would make sense. You can get the total time available to a suite at the beginning (in a BeforeSuite for example) by reading the GinkgoConfiguration(). You could then have a parent Context that sets that timeout and gets distributed to all the specs.

If it doesn't return by the time that the suite timeout is reached, ginkgo has to exit the process without cleaning up.

IMO the cost of explicitly not cleaning up if a spec is misbehaving/stuck seems higher to me than the cost of cleaning up concurrently. I'm not planning on changing the current behavior around cleaning up.

@pohly
Copy link
Contributor Author

pohly commented Apr 27, 2022

IMO the cost of explicitly not cleaning up if a spec is misbehaving/stuck seems higher to me than the cost of cleaning up concurrently. I'm not planning on changing the current behavior around cleaning up.

That is very subjective. My own preference is the other way around, at least for the suites that I work on. But I can also see why for other suites the current behavior may be more suitable. Perhaps we could do a poll to find out which mode users would pick?

If both approaches would have users, we could add a command line switch that prevents concurrent cleanup and then do things like "panic during It" and "wait for It termination".

@aojea
Copy link

aojea commented Apr 29, 2022

timeouts on tests are a difficult topic , see related golang issue about "per-test timeouts" golang/go#48157.

It seems per Onsi comments that there are some implementation details that are difficult to overcome, but maybe adding the necessary tooling to overcome these problems is the solution here, will this GinkgoContext() allow to users to implement graceful shutdowns and per test timeouts ?

@pohly
Copy link
Contributor Author

pohly commented Apr 29, 2022

will this GinkgoContext() allow to users to implement graceful shutdowns and per test timeouts ?

I think it would help those suite developers who don't want to run cleanup code in parallel to a running test. Suite developers who are okay with potential race conditions during cleanup don't need it (current approach).

@pohly
Copy link
Contributor Author

pohly commented Apr 29, 2022

@aojea : you know the Kubernetes E2E suite. What's your opinion, would it make sense for us to run AfterEach in parallel to an It that has timed out but keeps running? The code is not prepared for this kind of concurrency, which is my main concern.

@aojea
Copy link

aojea commented Apr 29, 2022

@aojea : you know the Kubernetes E2E suite. What's your opinion, would it make sense for us to run AfterEach in parallel to an It that has timed out but keeps running? The code is not prepared for this kind of concurrency, which is my main concern.

The cleanup on the E2E depends on Kubernetes "namespaces", the framework creates a new namespace for each test, runs the test and in AfterEach it deletes the namespace.

The garbage collector then should kick in and try to delete all the things in the namespace asynchronous, IIRC it doesn't "force" so if the It keeps running it will not be able to clean it up. However, if the It is destroyed I think that the garbage collector should be able to proceed with the deletion, that key should be that we should be able to send the "namespace deletion" ... but I was thinking this was solved by kubernetes/kubernetes#20771 #222

The second approach relies on specs being written to use that context instead of context.Background() or context.TODO(). It has the advantage that additional goroutines that may have been spawned also get cancelled, assuming that they are controlled by a context or the stop channel associated with that context. At least in Kubernetes, that is a common pattern. Obviously code will have to be adapted, but this should be doable.

However, I think that "contexts" are the canonical way to solve these problems in golang, then if an It hangs is because the test is not well written because is not using the "context" correctly

@onsi
Copy link
Owner

onsi commented Apr 29, 2022

I have an idea for a design that would allow users to have a per-It timeout as well.

Contexts came on the scene after Ginkgo first went GA which is one reason why there isn't better native support for them.

@pohly
Copy link
Contributor Author

pohly commented Apr 29, 2022

The cleanup on the E2E depends on Kubernetes "namespaces", the framework creates a new namespace for each test, runs the test and in AfterEach it deletes the namespace.

Not all resources are necessarily tracked via the test's namespace and/or there are cases where deleting the namespace is not enough to clean up. For example, PVCs are created in the namespace. But they get provisioned and deleted by a CSI driver, so if cleanup deletes the namespace and then immediately also the CSI driver, the volume leaks because Kubernetes cannot remove the PVC and thus the namespace.

TLDR version: it's complicated...

@aojea
Copy link

aojea commented Apr 29, 2022

TLDR version: it's complicated...

... and not only a problem for tests 🙃

@pohly
Copy link
Contributor Author

pohly commented Jul 8, 2022

I have an idea for a design that would allow users to have a per-It timeout as well.

Now that Kubernetes has switched to Ginkgo v2, I'd like to continue investigating how we can improve the timeout handling.

@onsi: Can you share what this design would be? Need any help?

@onsi
Copy link
Owner

onsi commented Aug 24, 2022

So a quick sketch of what I'm imagining. I'll need to flesh this out and really think through the various edge cases. But, for starters, I could imagine that any node (i.e. It, BeforeEach, etc.) can opt-into timeout behavior by passing in a function that accepts a GinkgoContext and an optional NodeTimeout:

It(func(gc GinkgoContext) {
    //do stuff, check for closure of `gc.Done()` etc.
}, NodeTimeout(10*time.Second))

(Note that there is not an overall spec timeout. Just timeouts for each individual node. I could imagine adding a SpecTimeout() decorator that only goes on It nodes if need be but let's not get ahead of ourselves.)

Now if the node timeout elapses or if an interrupt signal is received Ginkgo would Cancel() the context (or configure the Deadline() in the case of a timeout). The tricky bit is... what happens next?

Today Ginkgo's behavior is to mark the node as failed and immediately start running the various clean up nodes. Since Go does not allow one goroutine to forcibly kill another the node that is stuck continues running in the background while the clean up nodes run. My understanding is that y'all are concerned that these nodes running in parallel will be problematic.

So - we could say that the contract with a node that receives a GinkgoContext is that Ginkgo will wait until that node exits before it will run any clean up nodes. That solves the problem of having multiple nodes running in parallel.

But... what if the node is truly stuck or poorly written (i.e. someone missed checking gc.Done() somewhere) and never exits? Or needs another 5 minutes before it will exit? What do we do then? Should Ginkgo wait forever? Even if multiple interrupts are sent? If another interrupt is sent should it simply exit and avoid running any clean up nodes? And if so, which ones? All of them? Even the SynchronizedAfterSuite node that cleans up external resources that the specs relied on?

My sense is if I implemented an approach where Ginkgo hangs forever waiting for the node to exit I'll quickly get issues asking me to change the behavior. So this is something that I expect we need to solve up front. I'd love to hear what sorts of solutions folks have in mind... as I think on it I keep coming back to a version of where we are today (namely: eventually one ends up in a place where nodes simply have to run in parallel to each other) but with more complexity.

@pohly
Copy link
Contributor Author

pohly commented Aug 25, 2022

So a quick sketch of what I'm imagining. I'll need to flesh this out and really think through the various edge cases. But, for starters, I could imagine that any node (i.e. It, BeforeEach, etc.) can opt-into timeout behavior by passing in a function that accepts a GinkgoContext and an optional NodeTimeout:

It(func(gc GinkgoContext) {
    //do stuff, check for closure of `gc.Done()` etc.
}, NodeTimeout(10*time.Second))

Can we make the parameter a normal context.Context instead of GinkgoContext? Then it can be passed into other functions (for example, most of client-go) and those will do the right thing automatically (= abort) when the context gets cancelled.

(Note that there is not an overall spec timeout. Just timeouts for each individual node. I could imagine adding a SpecTimeout() decorator that only goes on It nodes if need be but let's not get ahead of ourselves.)

I think individual nodes should be fine.

So - we could say that the contract with a node that receives a GinkgoContext is that Ginkgo will wait until that node exits before it will run any clean up nodes. That solves the problem of having multiple nodes running in parallel.

Agreed.

But... what if the node is truly stuck or poorly written (i.e. someone missed checking gc.Done() somewhere) and never exits? Or needs another 5 minutes before it will exit? What do we do then? Should Ginkgo wait forever? Even if multiple interrupts are sent? If another interrupt is sent should it simply exit and avoid running any clean up nodes? And if so, which ones? All of them? Even the SynchronizedAfterSuite node that cleans up external resources that the specs relied on?

I would do the following:

  • On a single SIGINT (= CTRL-C), enter graceful shutdown:
    • Signal the running node and wait for it indefinitely to terminate before proceeding.
    • Print a message at regular intervals saying "Still waiting for XYX to complete, press CTRL-C once to continue with shutdown or twice to exit immediately".
    • The difference between "once" or "twice" would be how quickly SIGINT gets received. There has to be a small delay (one second?) before reacting to the first one because another one might be coming, which influences the reaction. A single one moves to the next node. A double one calls os.Exit.
  • On SIGTERM, enter emergency shutdown:
    • Signal the running node and wait for it to terminate for a configurable, by default small amount of time (10 seconds?), then proceed.
    • Run all cleanup and final reporting nodes, using the same timeout.

The SIGINT handling covers running tests interactively. The user is in control and can decide.

The SIGTERM handling covers shutdowns initiated by a CI. I believe Jenkins uses it. Not sure about Prow, but at least Kubernetes pods also get killed with SIGTERM.

@onsi
Copy link
Owner

onsi commented Aug 25, 2022

GinkgoContext would satisfy the context.Context interface, so no issues there.

Your proposal for SIGINT makes sense - but it's not far from today's behavior. Today Ctrl-C immediately starts running the next cleanup node. So if you have multiple clean up nodes, you can keep hitting Ctrl-C to get to the end. Or just wait.

For SIGTERM what you describe is similar to the behavior that Ginkgo has today if Ginkgo's timeout elapses. Each clean up node gets 30 seconds to finish (and I'd be happy to make that configurable, it isn't today).

It feels, though, like we've lost sight of the original problems that opened this issue:

So far I have only found -ginkgo.timeout. One downside of it is that it also aborts cleanup operations as soon as those block, which is not useful for Kubernetes because the cleanup operation may involved communication with the apiserver to remove objects.

That isn't the behavior of -ginkgo.timeout - rather, Ginkgo will wait for 30s for each cleanup node and then re-trigger an interrupt. It will do so up to 10 times (i.e. for up to 5 minutes) before giving up and forcibly quitting. I believe the k8s e2e suite, however, wraps everything in a separate timeout that tears everything down? Perhaps the behavior of that timeout is the problem and not Ginkgo's?

The other problem that was raised is:

However, I am now worried about the following situation:

  • An It node is running when the suite times out. It keeps running because ginkgo cannot kill the goroutine.
  • ginkgo starts the AfterEach node in parallel. The AfterEach code uses the same variables as the It code, so they now conflict.

but nothing we're discussing will prevent this in an iron-clad way. For example, the case where a timeout occurs and the node takes too long to finish so Ginkgo must move on (it can't wait forever in the case of a timeout!)

So... I guess... I'm not sure where we're landing. My sense is the existing best-effort behavior works reasonably well for most situations and that adding additional blocking/waiting via a context.Context (which would require rewriting all the specs to use the context.Context, btw) theoretically solves things for some scenarios - but doesn't do so across the board in a consistent way.

@pohly
Copy link
Contributor Author

pohly commented Aug 25, 2022

Your proposal for SIGINT makes sense - but it's not far from today's behavior. Today Ctrl-C immediately starts running the next cleanup node. So if you have multiple clean up nodes, you can keep hitting Ctrl-C to get to the end. Or just wait.

But I don't want to immediately start the next node, not while the current one is still doing some work (concurrency and potential data races). We could make this configurable, with the default as it is now.

What is currently missing is a "abort right now". For that one has to have another shell ready and do a kill to send a SIGTERM or SIGKILL. But this is probably minor.

For SIGTERM what you describe is similar to the behavior that Ginkgo has today if Ginkgo's timeout elapses. Each clean up node gets 30 seconds to finish (and I'd be happy to make that configurable, it isn't today).

It feels, though, like we've lost sight of the original problems that opened this issue:

So far I have only found -ginkgo.timeout. One downside of it is that it also aborts cleanup operations as soon as those block, which is not useful for Kubernetes because the cleanup operation may involved communication with the apiserver to remove objects.

That isn't the behavior of -ginkgo.timeout - rather, Ginkgo will wait for 30s for each cleanup node and then re-trigger an interrupt. It will do so up to 10 times (i.e. for up to 5 minutes) before giving up and forcibly quitting.

I'm not sure anymore where the "aborts cleanup operations as soon as those block" came from - it probably was a misunderstanding because I wasn't aware that it just moves on while the cleanup is still on-going.

I believe the k8s e2e suite, however, wraps everything in a separate timeout that tears everything down? Perhaps the behavior of that timeout is the problem and not Ginkgo's?

We have several individual timeouts for certain operations like "wait for pod to run", but we don't have any overall timeout for "this cleanup operation overall may take 5 minutes". The effect is that an AfterEach can be stuck for a long time while waiting for each of the individual operations to time out. Also, if any operation lacks a timeout, AfterEach can block forever. Having a context that gets passed around would be a more reliable and predictable way to enforce an overall upper limit for the duration.

The other problem that was raised is:

However, I am now worried about the following situation:

  • An It node is running when the suite times out. It keeps running because ginkgo cannot kill the goroutine.
  • ginkgo starts the AfterEach node in parallel. The AfterEach code uses the same variables as the It code, so they now conflict.

but nothing we're discussing will prevent this in an iron-clad way. For example, the case where a timeout occurs and the node takes too long to finish so Ginkgo must move on (it can't wait forever in the case of a timeout!)

With my proposal, the It nodes would be written so that they cannot take too long to finish. Once their context is cancelled, they will return pretty quickly because every operation in them immediately returns with a permanent failure.

The AfterEach still can run longer, but that's a termination problem, not a concurrency problem.

So... I guess... I'm not sure where we're landing. My sense is the existing best-effort behavior works reasonably well for most situations and that adding additional blocking/waiting via a context.Context (which would require rewriting all the specs to use the context.Context, btw) theoretically solves things for some scenarios - but doesn't do so across the board in a consistent way.

I agree with the "rewriting all specs" - that's the bullet we have to bite. But once we have that, at least in Kubernetes the problem should be solved pretty well because most tests depend on a context. It's just that right now, they typically use context.Background instead of something that gets cancelled or times out.

@onsi
Copy link
Owner

onsi commented Aug 25, 2022

With my proposal, the It nodes would be written so that they cannot take too long to finish. Once their context is cancelled, they will return pretty quickly because every operation in them immediately returns with a permanent failure.

But there would still need to be a backstop, correct? Let's focus on the timeout use-case (either suite level with --ginkgo.timeout or the per-node timeout). Let's say the timeout occurs and the context is cancelled. Now what? Do we wait indefinitely? Do we wait for 30 seconds? Something configurable? Cancelling the context doesn't guarantee the It will finish in a reasonable period of time (there could be a bug in the user's code or the test). So you still end up with a situation where the clean up nodes could run in parallel with the It nodes, right?

Is the idea that "that's ok if it happens rarely but it's super disruptive that it happens any time a user ^C or a suite timeout is hit"?

I guess a part of me is trying to understand how big an issue this actually is before I invest effort in implementing the context stuff, and to figure out how to prioritize it. To flip it around: Ginkgo has always had this concurrent cleanup behavior in it from the very beginning. But it hasn't seemed to cause any actual issues in practice. Here, though, I don't have all the context you and @aojea have - for example, is it a common and painful occurrence that hitting ^C leaves a mess behind that takes time and effort to clean up? In my experience in non-k8s contexts the AfterSuite (or whatever test harness has been implemented alongside Ginkgo) usually does a solid job of just tearing everything down and getting everything back to a clean state. And ^C or a timeout is usually happening because the It is stuck so running the clean up in parallel doesn't, in practice, have any negative effects because the It isn't actually doing anything but sitting there.

(I appreciate it sounds like I'm trying to avoid doing the work - and, perhaps, at some level I'm wary of taking this on if the benefit isn't actually commensurate to the effort. I get that contexts are the cleaner way to do this. But I'm trying to gauge if this is a theoretical concern or a practical/pragmatic one.)

@pohly
Copy link
Contributor Author

pohly commented Aug 26, 2022

But there would still need to be a backstop, correct? Let's focus on the timeout use-case (either suite level with --ginkgo.timeout or the per-node timeout). Let's say the timeout occurs and the context is cancelled. Now what? Do we wait indefinitely? Do we wait for 30 seconds? Something configurable?

It's a bit like killing a Pod: first there is SIGTERM, which can be caught and handled, then 30 seconds (configurable) later a SIGKILL, which cannot be caught. Here we don't have a SIGTERM, but continuing despite the running node comes close: it's the equivalent to "we cannot shut down cleanly".

So yes, 30 second grace period, with a configuration period, and then continuing sounds fine to me.

for example, is it a common and painful occurrence that hitting ^C leaves a mess behind that takes time and effort to clean up? In my experience in non-k8s contexts the AfterSuite (or whatever test harness has been implemented alongside Ginkgo)

The test harness in Kubernetes typically cannot clean up resources like volumes because it doesn't know about them. If those are in the cloud, they may cause costs until someone cleans up manually.

And ^C or a timeout is usually happening because the It is stuck so running the clean up in parallel doesn't, in practice, have any negative effects because the It isn't actually doing anything but sitting there.

Here I disagree. When I abort with ^C, it's often because I have either passed a certain test or I got a test failure that I want to investigate. The suite then will get interrupted in the middle of some other test that isn't stuck.

Same with a suite timeout: we run so many tests in Kubernetes with a fairly tight overall deadline, if some earlier tests ran more slowly, then later the suite timeout can hit a test that itself is fine and still making progress.

@onsi
Copy link
Owner

onsi commented Aug 26, 2022

you're certainly correct about the ^C usecase not being about stuck tests - i was overreaching and that's fair pushback.

stepping back - i can implement this and i remain open to implementing it, it's going to add complexity but it does make sense and better aligns with Go's semantics around context.

What i'm trying to convey is that (a) this will not make a strong guarantee that cleanup code will never happen in parallel with It code since we will have to have a backstop that effectively behaves how ginkgo behaves today and that (b) i'm skeptical that cleanup running in parallel with It code is actually a pressing problem for the e2e suite. You've shared a number of concerns on how it could in theory be a problem but what I'm not hearing is: "aha! we finally understand why our e2e test leave a mess behind so frequently - let's go fix it!"

So, in effect, my concern is: i go to the effort of implementing this, and y'all go to the effort of threading GinkgoContexts everywhere - but we aren't actually making life noticeably better for e2e. Does that make sense?

@pohly
Copy link
Contributor Author

pohly commented Aug 26, 2022

(a) this will not make a strong guarantee that cleanup code will never happen in parallel with It code since we will have to have a backstop that effectively behaves how ginkgo behaves today and that

True. What I am after is clear documentation to avoid surprises and (for those who want it) a configuration option. Then if someone absolutely never wants to get parallel execution, they can make the backstop duration very long or perhaps (if the option allows it) disable it.

(b) i'm skeptical that cleanup running in parallel with It code is actually a pressing problem for the e2e suite.

It's true that we don't have much actual experience with it. We have various complaints from developers who notice that cleanup was skipped (so it is important to support it!) but it is not something that occurs often enough that potential races have caused problems.

However, as we are now going to design a better solution, I want to make sure that we also avoid such corner cases before rewriting our e2e code.

So, in effect, my concern is: i go to the effort of implementing this, and y'all go to the effort of threading GinkgoContexts everywhere - but we aren't actually making life noticeably better for e2e. Does that make sense?

I think it will be noticeable, but I agree that it won't be that often.

@onsi
Copy link
Owner

onsi commented Nov 16, 2022

closing this out for now, @pohly feel free to reopen but i think we've made a lot of progress! thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants