-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Travis - Fix integration tests random errors #1484
Conversation
This reverts commit 09253ea.
…lly later"" This reverts commit 077b1ff.
This reverts commit a779bef.
This reverts commit e5233c8.
# Conflicts: # htlcswitch/switch.go # server.go
# Conflicts: # htlcswitch/switch.go
When running Travis integration tests, sometimes there are random errors: 1. The node fail to stop 2. The node can't execute RPC call since the server is not running yet. The reason for the second error is a race between LND and the test. LND opens the RPC port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases there are followup rpc calls (like listpeers) which may get an error if the server is not ready. The reason for the first problem is a race between LND startup and rpc stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead a situation that the LND process will not stop. This fix handles the problem by waiting, within the start() code until the node is fully synced with the chain. This is done by calling getinfo and checking the results to make sure the node is synced to chain. As an implication of this fix we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet.
Change method of validation from getinfo.SyncedToChain to a call to DisconnectPeer which returns an error "chain backend is still syncing" until the server started. Two comments here: 1. Server is marked started at the begging of the Start() function. This is not safe and should be marked at the end of the function. 2. Better to add a field to getInfo response telling of the server is Synched. I will this enhancment once I get some positive feedback for this PR.
This reverts commit 0686de0.
Thanks for the PR! Can you clean up commit history a bit? Such that it's just the sole commits that attempt to address the flakes in the integration tests? |
Yeah we've been seeing this pop up recently, wherein the set of tests hits the 10 minute goroutine livelock timer as a result of the goroutines not being able to exit properly. If we can resolve this, then I think we'll be able to soon restore travis to it's former green glory! |
Will take care of it once I finish the work. There are still failures to resolve. Can you share the configuration used to run the integration tests? Is it a single CPU/core machine? how much memory is available for it? These integration tests run without any issue on my development machine. It will be great to understand why there are all these problems under Travis. Offer |
A Nokia phone? No idea tbh, we see this behavior consistently where things work for us all locally, but then on travis we run into weird failures. On the upside, the constrained environment the tests execute within have helped us to catch so obscure bugs, but it makes things very hard to reproduce. Lately, we've enabled full extraction of logs which has helped us reduce the number of flakes in execution. |
Prevents travis random failures (race)
When running Travis integration tests, sometimes there are random errors:
The reason for the second error is a race between LND and the test. LND opens the RPC port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases, there are followup RPC calls (like listpeers) which may get an error if the server is not ready.
The reason for the first problem is a race between LND startup and RPC stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead to a situation that the LND process will not stop.
This fix handles the problem by waiting, within the start() code until the server started. This is done by calling DisconnectPeer and checking the results to make sure the server started. DisconnectPeer returns an error ("chain backend is still syncing") until the server is marked started
As an implication of this fix, we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet.
Two comments here:
The server is marked started at the begging of the Start() function. This is not safe and should be marked at the end of the function.
Better to add a field to getInfo response telling of the server is Synched.
I will this enhancement once I get some positive feedback for this PR.