Travis - Fix integration tests random errors #1484

offerm · 2018-06-30T08:37:42Z

Prevents travis random failures (race)

When running Travis integration tests, sometimes there are random errors:

The node fail to stop
The node can't execute RPC call since the server is not running yet.

The reason for the second error is a race between LND and the test. LND opens the RPC port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases, there are followup RPC calls (like listpeers) which may get an error if the server is not ready.

The reason for the first problem is a race between LND startup and RPC stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead to a situation that the LND process will not stop.

This fix handles the problem by waiting, within the start() code until the server started. This is done by calling DisconnectPeer and checking the results to make sure the server started. DisconnectPeer returns an error ("chain backend is still syncing") until the server is marked started

As an implication of this fix, we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet.

Two comments here:

The server is marked started at the begging of the Start() function. This is not safe and should be marked at the end of the function.
Better to add a field to getInfo response telling of the server is Synched.

I will this enhancement once I get some positive feedback for this PR.

This reverts commit 09253ea.

…lly later"" This reverts commit 077b1ff.

This reverts commit a779bef.

This reverts commit e5233c8.

# Conflicts: # htlcswitch/switch.go # server.go

# Conflicts: # htlcswitch/switch.go

When running Travis integration tests, sometimes there are random errors: 1. The node fail to stop 2. The node can't execute RPC call since the server is not running yet. The reason for the second error is a race between LND and the test. LND opens the RPC port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases there are followup rpc calls (like listpeers) which may get an error if the server is not ready. The reason for the first problem is a race between LND startup and rpc stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead a situation that the LND process will not stop. This fix handles the problem by waiting, within the start() code until the node is fully synced with the chain. This is done by calling getinfo and checking the results to make sure the node is synced to chain. As an implication of this fix we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet.

Change method of validation from getinfo.SyncedToChain to a call to DisconnectPeer which returns an error "chain backend is still syncing" until the server started. Two comments here: 1. Server is marked started at the begging of the Start() function. This is not safe and should be marked at the end of the function. 2. Better to add a field to getInfo response telling of the server is Synched. I will this enhancment once I get some positive feedback for this PR.

This reverts commit 0686de0.

Make sure server started in all NewNode execution flows. No need for the check at shutdown.

Roasbeef · 2018-06-30T20:17:44Z

Thanks for the PR! Can you clean up commit history a bit? Such that it's just the sole commits that attempt to address the flakes in the integration tests?

Roasbeef · 2018-06-30T20:23:24Z

Yeah we've been seeing this pop up recently, wherein the set of tests hits the 10 minute goroutine livelock timer as a result of the goroutines not being able to exit properly. If we can resolve this, then I think we'll be able to soon restore travis to it's former green glory!

offerm · 2018-06-30T21:11:09Z

Will take care of it once I finish the work. There are still failures to resolve.

Can you share the configuration used to run the integration tests? Is it a single CPU/core machine? how much memory is available for it?

These integration tests run without any issue on my development machine. It will be great to understand why there are all these problems under Travis.

Offer

Roasbeef · 2018-06-30T22:27:45Z

Can you share the configuration used to run the integration tests? Is it a single CPU/core machine? how much memory is available for it?

A Nokia phone? No idea tbh, we see this behavior consistently where things work for us all locally, but then on travis we run into weird failures. On the upside, the constrained environment the tests execute within have helped us to catch so obscure bugs, but it makes things very hard to reproduce. Lately, we've enabled full extraction of logs which has helped us reduce the number of flakes in execution.

Roasbeef and others added 16 commits June 11, 2018 15:47

Revert "server: use correct inbound value for peerConnected calls"

c0bed91

This reverts commit 09253ea.

Revert "Revert "rpcserver: flip inbound bool for display, fix interna…

ce36d40

…lly later"" This reverts commit 077b1ff.

Revert "htlcswitch/switch_test: adds duplicate link add test"

8b3debd

This reverts commit a779bef.

Revert "htlcswitch/switch: reject duplicate links, purge link indexes"

1beb7c9

This reverts commit e5233c8.

Merge branch 'master' of https://github.com/lightningnetwork/lnd

b2d5086

# Conflicts: # htlcswitch/switch.go # server.go

Merge branch 'master' of https://github.com/lightningnetwork/lnd

f7dd13a

Merge branch 'master' of https://github.com/lightningnetwork/lnd

d44aa18

# Conflicts: # htlcswitch/switch.go

Fixing Master

22829a5

Merge branch 'master' of https://github.com/lightningnetwork/lnd

3b58306

Merge branch 'master' of https://github.com/lightningnetwork/lnd

59ee418

Revert

0686de0

Make sure the node completed startup before issueing shutdown

547613c

Revert "Revert"

0b73cbc

This reverts commit 0686de0.

Cleanup.

68fa98c

Make sure server started in all NewNode execution flows. No need for the check at shutdown.

offerm added 7 commits July 1, 2018 14:55

Try to increase max open files

1b19209

Travis ulimit debug

479e894

Travis cpu configuration

ab0e908

Increase timeout to be able to work under heavy load

189f890

Set context timeout properly

34924de

Mark server started at the end of the Start function to prevent a race

f3fc627

cleanup

5f6920a

offerm closed this Jul 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Travis - Fix integration tests random errors #1484

Travis - Fix integration tests random errors #1484

offerm commented Jun 30, 2018 •

edited

Loading

Roasbeef commented Jun 30, 2018

Roasbeef commented Jun 30, 2018

offerm commented Jun 30, 2018

Roasbeef commented Jun 30, 2018

Travis - Fix integration tests random errors #1484

Travis - Fix integration tests random errors #1484

Conversation

offerm commented Jun 30, 2018 • edited Loading

Roasbeef commented Jun 30, 2018

Roasbeef commented Jun 30, 2018

offerm commented Jun 30, 2018

Roasbeef commented Jun 30, 2018

offerm commented Jun 30, 2018 •

edited

Loading