Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase mint auditor test timeout #2492

Merged
merged 2 commits into from Sep 9, 2022

Conversation

cbeck88
Copy link
Contributor

@cbeck88 cbeck88 commented Sep 8, 2022

this increases the amount of time the test waits for mobilecoind and the mint auditor to catchup to network activity, and adds a named constant so that this can easily be adjusted in the future

this increases the amount of time the test waits for mobilecoind
and the mint auditor to catchup to network activity, and adds
a named constant so that this can easily be adjusted in the future
@cbeck88
Copy link
Contributor Author

cbeck88 commented Sep 8, 2022

In some discussions, people expressed unhappiness with the idea that the solution to flaky tests is to adjust sleep statements.

However, this is a concrete example where I cannot see any possible alternative fix.

Fundamentally, when you are doing an integration test with a distributed system, the test is a separate process from the code under test. There is nothing the test process can do to ensure that the other processes make progress -- indeed, they may not make progress, there may be a bug like a deadlock or something.

We could say that the test should loop infinitely until the other processes make progress. But this is also bad because tests have to time out eventually. We can't just let the test run forever, at some point we have to accept that it failed.

So I think increasing the sleep statement is the correct fix, and I don't see that there is any other possible fix that makes sense.
I think it may be that if you want to do integration tests of a distributed system, you will have to use some sleep statements, and you will have to maintain those sleep statements as part of maintaining the tests. There's just no way around it afaict.

One of the main complaints was that, if we ever make changes to the CI runners, so that they are faster or slower or whatever, then all these sleep statements may have to be adjusted, which is a lot of work.

There are other things we could do I think:

  • Try to make these timeout values controlled by a global constant to make it easier to change.
  • Or even better, maybe an environment variable? So that they can all be changed at once easily
  • We could decide that the test code should have an infinite loop, and a supervisory process (e.g. the GHA runner) is responsible for enforcing timeouts. But I don't really like that because it will harm the user experience of locally running tests.

Interested what other people think about this

@nick-mobilecoin
Copy link
Collaborator

There is nothing the test process can do to ensure that the other processes make progress

Is there no heartbeat from these processes? When we host them how do we know that they're doing something and not hung?
If there is no heartbeat what would it take to add one?

We could decide that the test code should have an infinite loop, and a supervisory process (e.g. the GHA runner) is responsible for enforcing timeouts. But I don't really like that because it will harm the user experience of locally running tests.

Do these tests get run individually some_test.py or are they run as a group some_batch_test_runner.py. I think if a user invoked a test directly it's not unreasonable for them to kill the test if it hangs on them. If it's launched as a group then the launcher should have logic to kill the subprocesses after timeout. cargo test -- --help has

        --ensure-time   Treat excess of the test execution time limit as
                        error.
                        Threshold values for this option can be configured via
                        `RUST_TEST_TIME_UNIT`, `RUST_TEST_TIME_INTEGRATION`
                        and
                        `RUST_TEST_TIME_DOCTEST` environment variables.
                        Expected format of environment variable is
                        `VARIABLE=WARN_TIME,CRITICAL_TIME`.
                        `CRITICAL_TIME` here means the limit that should not
                        be exceeded by test.

@cbeck88
Copy link
Contributor Author

cbeck88 commented Sep 8, 2022

Is there no heartbeat from these processes? When we host them how do we know that they're doing something and not hung?
If there is no heartbeat what would it take to add one?

The processes have a minimal concept of "readiness" which is used by kubernetes, but this is basically just a flag that says "the grpc server was started and the server declared itself ready to accept requests". It doesn't communicate if the server is stuck for some reason.

To actually determine if processes are making progress, we have prometheus metrics which get plotted in grafana, and a human can look at it and decide if a server is stuck or is falling behind.

Ultimately even that isn't reliable enough for alerting in production. The way we actually ensure that the system is working in prod is by exercising it using the test client. We send a transaction, then check balances. We do this once a minute or so. If the test client observes something wonky, or it takes more than a minute for the test client to make progress, then we fire an alert.

So in the end, it's the same as the way this test is working:

  1. Exercise the system
  2. Wait a while
  3. If nothing happened, then the test failed / fire an alert

I think this is actually how it works for testing of all real distributed systems, even scalar clock systems. You may be able to write more unit tests and less integration tests for systems like this, and rely less on sleeps and timeouts, but at the end of the day when you want to test the system as a whole it boils down to what we are doing here.

@cbeck88
Copy link
Contributor Author

cbeck88 commented Sep 8, 2022

I think if a user invoked a test directly it's not unreasonable for them to kill the test if it hangs on them.

Sure but what are accomplishing by restructuring the system that way.

We can say, we'll push the burden of setting the timeouts onto the user. But ultimately we are the user, so we are still in a position of maintaining all the timeouts.

I'm generally in favor of making it as easy to run the tests as possible, ideally it's just like a push-button cargo test. What do we gain by forcing the user to supply a timeout manually instead of having that be part of the tests? doesn't the user just always want to use the "correct" value? And don't we have to know what the correct value is anyways? So why make it harder?

@cbeck88
Copy link
Contributor Author

cbeck88 commented Sep 8, 2022

If there is no heartbeat what would it take to add one?

This is what the test code looks like right now:

    def wait_for_mint_auditor_to_sync(self):
        """Wait for the mint auditor to sync to the network block height.
        Return the last block audit data"""
        network_block_index = self.mobilecoind_client.get_network_block_index()
        for _ in range(TIMEOUT_SECONDS):
            response = self.mint_auditor_client.get_last_block_audit_data()
            if response.block_audit_data.block_index == network_block_index:
                break

So, get_last_block_audit_data().block_audit_data.block_index is literally the heartbeat. That's how we know if the service is making progress. And we are using it. But this just isn't related to the need for a timeout -- even with these heartbeats, we still need a timeout, and therefore have the need to maintain a timeout.

@cbeck88
Copy link
Contributor Author

cbeck88 commented Sep 8, 2022

I mean I would be okay with something like:

  1. By convention, the env MC_TEST_TIMEOUT sets how many seconds we wait in these loops, and all these magic numbers are replaced with references to that, whether in python code, or rust code, or whatever
  2. I would add something like MC_TEST_TIMEOUT=60 to the ./mob prompt tool when it launches the prompt container
  3. We could similarly add an env setting like this in the GHA CI runners
  4. In the future when we get environment-related timeouts, we can just twiddle that number

The main drawback I can see is that this number is likely to get big over time and nothing will make it smaller, so it may make tests take longer? But maybe that's fine.

@cbeck88
Copy link
Contributor Author

cbeck88 commented Sep 8, 2022

However, I'm still kind of skeptical of that kind of an approach, for instance, here's another test:

https://github.com/mobilecoinfoundation/mobilecoin/blob/master/go-grpc-gateway/test.sh


# Spawn rust stub server
"$CARGO_TARGET_DIR/debug/stub" --chain-id "local" --client-listen-uri insecure-fog://localhost:3000 &
pid=$!

sleep 1

# Spawn grpc proxy
./grpc-proxy -grpc-insecure -grpc-server-endpoint localhost:3000 -http-server-listen :8080 -logtostderr &
pid2=$!

sleep 5

result=$(curl -XPOST -H "Content-Type: application/json" http://localhost:8080/report.ReportAPI/GetReports -d "{}")
expected="{\"reports\":[],\"chain\":[],\"signature\":\"AAEAAQ==\"}"

This server only needs a second or two to start up at most, it's pretty simple. But if we don't sleep at all, then there is a race where we get to the curl part before it has started at all.

I don't think replacing 5 with MC_TEST_TIMEOUT is a good idea because that will make the test take at least 2 minutes to run when it should take a few seconds.

So I'm kind of skeptical of a one-size-fits-all approach.

Maybe this is just a tech-debt and there should be some kind of loop here which probes the servers after they are spawned. Not sure.

If it isn't broken then it doesn't seem very interesting to spend time fixing it.

Maybe we could adopt a new convention though, and whenever things are broken we migrate them to the convention in order to fix them?

@eranrund
Copy link
Contributor

eranrund commented Sep 8, 2022

Maybe this is just a tech-debt and there should be some kind of loop here which probes the servers after they are spawned. Not sure.

Even if there's a loop, it will likely not be infinite...

I don't think a one-size-timeout-fits-all makes sense, as you just demonstrated above. I don't think a world where no sleeps are ever needed is reasonable - while it is technically possible to get there (for example, just to make a point - you can have processes report their state to a named pipe that is created in advance to them being started, and if they don't report what you expect in a certain amount of time then you abort), it is not worth the extra complication maintaining that side channel would entail. I think timeouts are a well established concept, and in some cases it is easier to implement them with a polling loop and some sleeps. Since in some tests stuff happens concurrently and takes nondeterministic amount of time, some form of waiting for what you expect to happen seems unavoidable. So whether it's via a dumb sleep or a more clever mechanism, you would still be putting time constraints on the test to succeed.
I think an improvement we could have is to move from arbitrarily long sleeps to polling loops so that we don't slow things unnecessarily (e.g. retry every 100ms for up to whatever amount of time we observed a thing could take), or slightly smarter polling loops (e.g. keep retrying for up to some amount of time, and reset your counter every time you are seeing progress - which is a thing I don't think we do in most places we sleep or poll).

@jcape
Copy link
Contributor

jcape commented Sep 8, 2022

I'll note that there is a practical difference between:

Do something
Poll for success until TIMEOUT expires
Do the next thing

and:

Do something
Sleep for what I think TIMEOUT should be
Do the next thing

In the former, there is no balancing act when selecting TIMEOUT. It's perfectly reasonable to "poll forever" where "forever" is "until the GHA scheduler comes along and kills it because it hasn't made forward progress". It's perfectly reasonable to set the TIMEOUT to 2 minutes.

Copy link
Collaborator

@nick-mobilecoin nick-mobilecoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM based on most in the discussion supporting the polling approach and this is updating a polling timeout

@samdealy
Copy link
Contributor

samdealy commented Sep 8, 2022

I think an improvement we could have is to move from arbitrarily long sleeps to polling loops so that we don't slow things unnecessarily (e.g. retry every 100ms for up to whatever amount of time we observed a thing could take), or slightly smarter polling loops (e.g. keep retrying for up to some amount of time, and reset your counter every time you are seeing progress - which is a thing I don't think we do in most places we sleep or poll).

Agreed that we should prefer polling to sleep statements. This article sums up the pros of polling or "busy-waiting" pretty well.

@cbeck88 cbeck88 merged commit 99883ec into master Sep 9, 2022
@cbeck88 cbeck88 deleted the try-to-make-mint-auditor-tests-less-flakey branch September 9, 2022 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants