Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis - Fix integration tests random errors #1484

Closed
wants to merge 23 commits into from
Closed

Travis - Fix integration tests random errors #1484

wants to merge 23 commits into from

Conversation

offerm
Copy link
Contributor

@offerm offerm commented Jun 30, 2018

Prevents travis random failures (race)

When running Travis integration tests, sometimes there are random errors:

  1. The node fail to stop
  2. The node can't execute RPC call since the server is not running yet.

The reason for the second error is a race between LND and the test. LND opens the RPC port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases, there are followup RPC calls (like listpeers) which may get an error if the server is not ready.

The reason for the first problem is a race between LND startup and RPC stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead to a situation that the LND process will not stop.

This fix handles the problem by waiting, within the start() code until the server started. This is done by calling DisconnectPeer and checking the results to make sure the server started. DisconnectPeer returns an error ("chain backend is still syncing") until the server is marked started

As an implication of this fix, we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet.

Two comments here:

  1. The server is marked started at the begging of the Start() function. This is not safe and should be marked at the end of the function.

  2. Better to add a field to getInfo response telling of the server is Synched.

I will this enhancement once I get some positive feedback for this PR.

Roasbeef and others added 16 commits June 11, 2018 15:47
# Conflicts:
#	htlcswitch/switch.go
#	server.go
When running Travis integration tests, sometimes there are random errors:
1. The node fail to stop
2. The node can't execute RPC call since the server is not running yet.

The reason for the second error is a race between LND and the test.   LND opens the RPC  port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases there are followup rpc calls (like listpeers) which may get an error if the server is not ready.

The reason for the first problem is a race between LND startup and rpc stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead a situation that the LND process will not stop.

This fix handles the problem by waiting, within the start() code until the node is fully synced with the chain. This is done by calling getinfo and checking the results to make sure the node is synced to chain.

As an implication of this fix we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet.
Change method of validation from getinfo.SyncedToChain to a call to DisconnectPeer which returns an error "chain backend is still syncing" until the server started.

Two comments here:

1. Server is marked started at the begging of the Start() function. This is not safe and should be marked at the end of the function.

2. Better to add a field to getInfo response telling of the server is Synched.

I will this enhancment once I get some positive feedback for this PR.
This reverts commit 0686de0.
Make sure server started in all NewNode execution flows.
No need for the check at shutdown.
@Roasbeef
Copy link
Member

Thanks for the PR! Can you clean up commit history a bit? Such that it's just the sole commits that attempt to address the flakes in the integration tests?

@Roasbeef
Copy link
Member

Yeah we've been seeing this pop up recently, wherein the set of tests hits the 10 minute goroutine livelock timer as a result of the goroutines not being able to exit properly. If we can resolve this, then I think we'll be able to soon restore travis to it's former green glory!

@offerm
Copy link
Contributor Author

offerm commented Jun 30, 2018

Will take care of it once I finish the work. There are still failures to resolve.

Can you share the configuration used to run the integration tests? Is it a single CPU/core machine? how much memory is available for it?

These integration tests run without any issue on my development machine. It will be great to understand why there are all these problems under Travis.

Offer

@Roasbeef
Copy link
Member

Can you share the configuration used to run the integration tests? Is it a single CPU/core machine? how much memory is available for it?

A Nokia phone? No idea tbh, we see this behavior consistently where things work for us all locally, but then on travis we run into weird failures. On the upside, the constrained environment the tests execute within have helped us to catch so obscure bugs, but it makes things very hard to reproduce. Lately, we've enabled full extraction of logs which has helped us reduce the number of flakes in execution.

@offerm offerm closed this Jul 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants