-
-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timeouts and cleanup #969
Comments
hey there - lots to dig into here. First the behavior of
When a timeout happens Ginkgo stops execution of the current spec (or specs, if running in parallel) and then starts to unwind the suite and clean up. It runs through each relevant As you can see there are some hard-coded values in there. I'de be happy to expose those values as command-line flags so you can configure them. Doing so would allow you to give your suite an arbitrarily long time to run the Ginkgo used to have support for per-spec timeouts but I've removed that support. The link goes into some of the why but one of the main reasons is because timed-out specs continue to run in the background. If/when they eventually fail the failure can get erroneously registered with the currently running spec and lead to some hella confusing output that can be hard to debug. (This is all a consequence of the design decision, made long ago, to give users a DSL where you don't have to pass anything into each Ginkgo node. The resulting reliance on global state has been... ok but one of the biggest pain points is this edge case: I simply have no way (without hacking at goroutine ids) to associate a failure with a given goroutine). But I actually have come to the point of view that per-spec timeouts really isn't the right long-term solution anyway. Rather it's better to rely on Gomega's async assertions to pepper appropriate timeouts on calls that could potentially time out. This gives you the context you need when a timeout failure occurs ("timed out waiting for such-and-such api call to happen" vs "this spec timed out. somewhere.") and it cleans up the spec nicely because subsequent operations in the spec do not run (except for the cleanup I think y'all are using Note that if you do want to provide spec-level timeouts that is possible via the migration strategy proposed in the Migration doc. So there's nothing preventing implementing a class of specs governed by timeouts. However I do recommend the more granular approach described above. |
How does it abort a running spec, i.e. which mechanism is used to force a Go function to return? Just curious.
I think that would be useful. Exposing this as options then also is an opportunity to document this behavior in a user-visible place. At least I didn't see anything about this aspect.
I tend to agree, and as I said somewhere (not sure anymore where), Kubernetes should have timeouts defined pretty much everywhere, so this works for us.
I wish we did, but unfortunately not. |
However, those are usually local timeouts, not per-spec timeouts. So if a spec creates 10 pods, each pod creation gets the same 5 minute timeout instead of having an overall timeout for the spec. I've seen cases where something was wrong, but not wrong enough to time out any of the individual steps, so the spec kept running although it would have been better to abort after a certain per-spec timeout. |
It's also hard to review how long the entire spec is expected to run. One would have to sum up the individual timeouts, of which most are hidden behind helper functions. |
I wish I could forcibly end the go function. But that isn't possible. It just moves on to other things (so you actually potentially end up with multiple things running in the background when a timeout occurs and is unwinding)
OK sounds good. I'll work on it and will try to ship something soon.
that's also part of the challenge - what should you set the spec-level tiemout to? In any event, as I said, y'all could implement spec-level timeouts on a case-by-case basis using the pattern in the migration doc. |
That's what I thought.
So you are saying that when something times out, the code keeps running in the background and ginkgo just "pretends" that the It or AfterEach has stopped? Then why does my example not print "stopped" via |
your example doesn't print "stop" because the test process exits after the last does that make sense? in Go one goroutine can't cause a different goroutine to exit. so Ginkgo has to let these timed-out goroutines "leak" - but since a timeout has occurred Ginkgo's goal is to try to run as many of the cleanup nodes as possible and then exit. so when the last clean up node times out there aren't any others to run and Ginkgo invokes |
That makes sense. However, I am now worried about the following situation:
Can this happen? If not, how does ginkgo prevent it? My assumption was that ginkgo raises a panic when it gets the chance (for example, because |
hey @pohly - yes you have it right. I think of suite timeouts as exceptional situations (similar to user-initiated interrupts like sending a So, yes, the If you're observing or expecting a specific concern based on this behavior let me know. |
It might be worth calling that out as a caveat in the documentation, if it's not there already. I might have missed it.
Not in my experience. The E2E suite in Kubernetes can sometimes run much more slowly than usual, without any particular test hanging. Then when the timeout occurs in a CI job, some random spec is still active and running. When debugging interactively, I tend to use CTRL-C to abort when I see that I invoked the suite incorrectly or I am not interested in further results, which also happens to be in the middle of some healthy spec. If I then see that some AfterEach is hanging, I might try CTRL-C again to indicate that now I really want to exist ASAP. My expectation in both cases is that ginkgo shuts down cleanly (assuming that AfterEach doesn't hang), without relying on winning race conditions.
We track resources that have to be cleaned up, so calling
I can think of several solutions:
The second approach relies on specs being written to use that context instead of |
I'm not sure if it's there but I'll double check and make sure to add it.
this actually isn't possible - there's no way for
today there isn't a great way for Ginkgo to inject information into specs. I could imagine a I'm open to doing that and will add it to the backlog - but I can't give you a confident release date yet. While a cancellable context does help it doesn't necessarily guarantee race-free clean up. Ginkgo would cancel the context and then immediately start running the |
Isn't it so that there's always only one active node in a process? At least internally Gingko should be able to identify that node and then panic only when the active node is an
They could, if there was an API to determine the time that a spec still has left. I don't think there is one at the moment. The advantage of getting a context from Ginkgo is that it can be integrated with signal handling: Ginkgo can cancel that context when it receives a signal or when the timeout is reached, whatever happens first.
That's true. Ginkgo would also have to wait for the Basically the promise of an |
Yes, except in this one context where an interrupt/timeout has occurred and Ginkgo is cleaning up as quickly as possible.
yep, a
IMO the cost of explicitly not cleaning up if a spec is misbehaving/stuck seems higher to me than the cost of cleaning up concurrently. I'm not planning on changing the current behavior around cleaning up. |
That is very subjective. My own preference is the other way around, at least for the suites that I work on. But I can also see why for other suites the current behavior may be more suitable. Perhaps we could do a poll to find out which mode users would pick? If both approaches would have users, we could add a command line switch that prevents concurrent cleanup and then do things like "panic during |
timeouts on tests are a difficult topic , see related golang issue about "per-test timeouts" golang/go#48157. It seems per Onsi comments that there are some implementation details that are difficult to overcome, but maybe adding the necessary tooling to overcome these problems is the solution here, will this |
I think it would help those suite developers who don't want to run cleanup code in parallel to a running test. Suite developers who are okay with potential race conditions during cleanup don't need it (current approach). |
@aojea : you know the Kubernetes E2E suite. What's your opinion, would it make sense for us to run |
The cleanup on the E2E depends on Kubernetes "namespaces", the framework creates a new namespace for each test, runs the test and in The garbage collector then should kick in and try to delete all the things in the namespace asynchronous, IIRC it doesn't "force" so if the
However, I think that "contexts" are the canonical way to solve these problems in golang, then if an |
I have an idea for a design that would allow users to have a per-It timeout as well. Contexts came on the scene after Ginkgo first went GA which is one reason why there isn't better native support for them. |
Not all resources are necessarily tracked via the test's namespace and/or there are cases where deleting the namespace is not enough to clean up. For example, PVCs are created in the namespace. But they get provisioned and deleted by a CSI driver, so if cleanup deletes the namespace and then immediately also the CSI driver, the volume leaks because Kubernetes cannot remove the PVC and thus the namespace. TLDR version: it's complicated... |
... and not only a problem for tests 🙃 |
Now that Kubernetes has switched to Ginkgo v2, I'd like to continue investigating how we can improve the timeout handling. @onsi: Can you share what this design would be? Need any help? |
So a quick sketch of what I'm imagining. I'll need to flesh this out and really think through the various edge cases. But, for starters, I could imagine that any node (i.e. It(func(gc GinkgoContext) {
//do stuff, check for closure of `gc.Done()` etc.
}, NodeTimeout(10*time.Second)) (Note that there is not an overall spec timeout. Just timeouts for each individual node. I could imagine adding a Now if the node timeout elapses or if an interrupt signal is received Ginkgo would Today Ginkgo's behavior is to mark the node as failed and immediately start running the various clean up nodes. Since Go does not allow one goroutine to forcibly kill another the node that is stuck continues running in the background while the clean up nodes run. My understanding is that y'all are concerned that these nodes running in parallel will be problematic. So - we could say that the contract with a node that receives a But... what if the node is truly stuck or poorly written (i.e. someone missed checking My sense is if I implemented an approach where Ginkgo hangs forever waiting for the node to exit I'll quickly get issues asking me to change the behavior. So this is something that I expect we need to solve up front. I'd love to hear what sorts of solutions folks have in mind... as I think on it I keep coming back to a version of where we are today (namely: eventually one ends up in a place where nodes simply have to run in parallel to each other) but with more complexity. |
Can we make the parameter a normal
I think individual nodes should be fine.
Agreed.
I would do the following:
The SIGINT handling covers running tests interactively. The user is in control and can decide. The SIGTERM handling covers shutdowns initiated by a CI. I believe Jenkins uses it. Not sure about Prow, but at least Kubernetes pods also get killed with SIGTERM. |
Your proposal for For It feels, though, like we've lost sight of the original problems that opened this issue:
That isn't the behavior of The other problem that was raised is:
but nothing we're discussing will prevent this in an iron-clad way. For example, the case where a timeout occurs and the node takes too long to finish so Ginkgo must move on (it can't wait forever in the case of a timeout!) So... I guess... I'm not sure where we're landing. My sense is the existing best-effort behavior works reasonably well for most situations and that adding additional blocking/waiting via a |
But I don't want to immediately start the next node, not while the current one is still doing some work (concurrency and potential data races). We could make this configurable, with the default as it is now. What is currently missing is a "abort right now". For that one has to have another shell ready and do a
I'm not sure anymore where the "aborts cleanup operations as soon as those block" came from - it probably was a misunderstanding because I wasn't aware that it just moves on while the cleanup is still on-going.
We have several individual timeouts for certain operations like "wait for pod to run", but we don't have any overall timeout for "this cleanup operation overall may take 5 minutes". The effect is that an AfterEach can be stuck for a long time while waiting for each of the individual operations to time out. Also, if any operation lacks a timeout, AfterEach can block forever. Having a context that gets passed around would be a more reliable and predictable way to enforce an overall upper limit for the duration.
With my proposal, the It nodes would be written so that they cannot take too long to finish. Once their context is cancelled, they will return pretty quickly because every operation in them immediately returns with a permanent failure. The AfterEach still can run longer, but that's a termination problem, not a concurrency problem.
I agree with the "rewriting all specs" - that's the bullet we have to bite. But once we have that, at least in Kubernetes the problem should be solved pretty well because most tests depend on a context. It's just that right now, they typically use context.Background instead of something that gets cancelled or times out. |
But there would still need to be a backstop, correct? Let's focus on the timeout use-case (either suite level with Is the idea that "that's ok if it happens rarely but it's super disruptive that it happens any time a user ^C or a suite timeout is hit"? I guess a part of me is trying to understand how big an issue this actually is before I invest effort in implementing the (I appreciate it sounds like I'm trying to avoid doing the work - and, perhaps, at some level I'm wary of taking this on if the benefit isn't actually commensurate to the effort. I get that |
It's a bit like killing a Pod: first there is SIGTERM, which can be caught and handled, then 30 seconds (configurable) later a SIGKILL, which cannot be caught. Here we don't have a SIGTERM, but continuing despite the running node comes close: it's the equivalent to "we cannot shut down cleanly". So yes, 30 second grace period, with a configuration period, and then continuing sounds fine to me.
The test harness in Kubernetes typically cannot clean up resources like volumes because it doesn't know about them. If those are in the cloud, they may cause costs until someone cleans up manually.
Here I disagree. When I abort with ^C, it's often because I have either passed a certain test or I got a test failure that I want to investigate. The suite then will get interrupted in the middle of some other test that isn't stuck. Same with a suite timeout: we run so many tests in Kubernetes with a fairly tight overall deadline, if some earlier tests ran more slowly, then later the suite timeout can hit a test that itself is fine and still making progress. |
you're certainly correct about the ^C usecase not being about stuck tests - i was overreaching and that's fair pushback. stepping back - i can implement this and i remain open to implementing it, it's going to add complexity but it does make sense and better aligns with Go's semantics around What i'm trying to convey is that (a) this will not make a strong guarantee that cleanup code will never happen in parallel with So, in effect, my concern is: i go to the effort of implementing this, and y'all go to the effort of threading |
True. What I am after is clear documentation to avoid surprises and (for those who want it) a configuration option. Then if someone absolutely never wants to get parallel execution, they can make the backstop duration very long or perhaps (if the option allows it) disable it.
It's true that we don't have much actual experience with it. We have various complaints from developers who notice that cleanup was skipped (so it is important to support it!) but it is not something that occurs often enough that potential races have caused problems. However, as we are now going to design a better solution, I want to make sure that we also avoid such corner cases before rewriting our e2e code.
I think it will be noticeable, but I agree that it won't be that often. |
closing this out for now, @pohly feel free to reopen but i think we've made a lot of progress! thanks! |
In Kubernetes, we would like to have:
-ginkgo.timeout
, with a very high value because the suite is large)-ginkgo.timeout
duration on a single spec)So far I have only found
-ginkgo.timeout
. One downside of it is that it also aborts cleanup operations as soon as those block, which is not useful for Kubernetes because the cleanup operation may involved communication with the apiserver to remove objects.I tried with this:
When I run with
-ginkgo.timeout=10s -ginkgo.v -ginkgo.progress
, I get:Note that the cleanup spec didn't run it's defer.
We could provide per-spec timeouts via
ctx := context.Timeout(context.Background(), 10*time.Minute)
. We then need to be careful that the cleanup spec doesn't use the same context because it wouldn't get any work done after a timeout. The downside of this is that we would have to touch a lot of code in Kubernetes, which is always a daunting prospect.IMHO it would be simpler to have a default
-ginkgo.it-timeout
(forIt
specs),-ginkgo.after-timeout
(forAfterEach
) and perhaps aTimeout(time.Duration)
decorator to override those defaults.The text was updated successfully, but these errors were encountered: