New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
increase mint auditor test timeout #2492
Conversation
this increases the amount of time the test waits for mobilecoind and the mint auditor to catchup to network activity, and adds a named constant so that this can easily be adjusted in the future
In some discussions, people expressed unhappiness with the idea that the solution to flaky tests is to adjust sleep statements. However, this is a concrete example where I cannot see any possible alternative fix. Fundamentally, when you are doing an integration test with a distributed system, the test is a separate process from the code under test. There is nothing the test process can do to ensure that the other processes make progress -- indeed, they may not make progress, there may be a bug like a deadlock or something. We could say that the test should loop infinitely until the other processes make progress. But this is also bad because tests have to time out eventually. We can't just let the test run forever, at some point we have to accept that it failed. So I think increasing the sleep statement is the correct fix, and I don't see that there is any other possible fix that makes sense. One of the main complaints was that, if we ever make changes to the CI runners, so that they are faster or slower or whatever, then all these sleep statements may have to be adjusted, which is a lot of work. There are other things we could do I think:
Interested what other people think about this |
Is there no heartbeat from these processes? When we host them how do we know that they're doing something and not hung?
Do these tests get run individually --ensure-time Treat excess of the test execution time limit as
error.
Threshold values for this option can be configured via
`RUST_TEST_TIME_UNIT`, `RUST_TEST_TIME_INTEGRATION`
and
`RUST_TEST_TIME_DOCTEST` environment variables.
Expected format of environment variable is
`VARIABLE=WARN_TIME,CRITICAL_TIME`.
`CRITICAL_TIME` here means the limit that should not
be exceeded by test. |
The processes have a minimal concept of "readiness" which is used by kubernetes, but this is basically just a flag that says "the grpc server was started and the server declared itself ready to accept requests". It doesn't communicate if the server is stuck for some reason. To actually determine if processes are making progress, we have prometheus metrics which get plotted in grafana, and a human can look at it and decide if a server is stuck or is falling behind. Ultimately even that isn't reliable enough for alerting in production. The way we actually ensure that the system is working in prod is by exercising it using the test client. We send a transaction, then check balances. We do this once a minute or so. If the test client observes something wonky, or it takes more than a minute for the test client to make progress, then we fire an alert. So in the end, it's the same as the way this test is working:
I think this is actually how it works for testing of all real distributed systems, even scalar clock systems. You may be able to write more unit tests and less integration tests for systems like this, and rely less on sleeps and timeouts, but at the end of the day when you want to test the system as a whole it boils down to what we are doing here. |
Sure but what are accomplishing by restructuring the system that way. We can say, we'll push the burden of setting the timeouts onto the user. But ultimately we are the user, so we are still in a position of maintaining all the timeouts. I'm generally in favor of making it as easy to run the tests as possible, ideally it's just like a push-button |
This is what the test code looks like right now:
So, |
I mean I would be okay with something like:
The main drawback I can see is that this number is likely to get big over time and nothing will make it smaller, so it may make tests take longer? But maybe that's fine. |
However, I'm still kind of skeptical of that kind of an approach, for instance, here's another test: https://github.com/mobilecoinfoundation/mobilecoin/blob/master/go-grpc-gateway/test.sh
This server only needs a second or two to start up at most, it's pretty simple. But if we don't sleep at all, then there is a race where we get to the I don't think replacing So I'm kind of skeptical of a one-size-fits-all approach. Maybe this is just a tech-debt and there should be some kind of loop here which probes the servers after they are spawned. Not sure. If it isn't broken then it doesn't seem very interesting to spend time fixing it. Maybe we could adopt a new convention though, and whenever things are broken we migrate them to the convention in order to fix them? |
Even if there's a loop, it will likely not be infinite... I don't think a one-size-timeout-fits-all makes sense, as you just demonstrated above. I don't think a world where no sleeps are ever needed is reasonable - while it is technically possible to get there (for example, just to make a point - you can have processes report their state to a named pipe that is created in advance to them being started, and if they don't report what you expect in a certain amount of time then you abort), it is not worth the extra complication maintaining that side channel would entail. I think timeouts are a well established concept, and in some cases it is easier to implement them with a polling loop and some sleeps. Since in some tests stuff happens concurrently and takes nondeterministic amount of time, some form of waiting for what you expect to happen seems unavoidable. So whether it's via a dumb sleep or a more clever mechanism, you would still be putting time constraints on the test to succeed. |
I'll note that there is a practical difference between:
and:
In the former, there is no balancing act when selecting TIMEOUT. It's perfectly reasonable to "poll forever" where "forever" is "until the GHA scheduler comes along and kills it because it hasn't made forward progress". It's perfectly reasonable to set the TIMEOUT to 2 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LTGM based on most in the discussion supporting the polling approach and this is updating a polling timeout
Agreed that we should prefer polling to sleep statements. This article sums up the pros of polling or "busy-waiting" pretty well. |
this increases the amount of time the test waits for mobilecoind and the mint auditor to catchup to network activity, and adds a named constant so that this can easily be adjusted in the future