Travis - Fix integration tests random errors #1493

offerm · 2018-07-01T21:43:13Z

Prevents travis random failures (race)
When running Travis integration tests, sometimes there are random errors:

Nodes fail to stop
A node can't execute RPC call since the server is not running yet.
issue Missing channel announcement #1496

The reason for the second error is a race between LND and the test. LND opens the RPC port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases, there are followup RPC calls (like listpeers) which may get an error if the server is not ready.

The reason for the first problem is a race between LND startup and RPC stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead to a situation that the LND process will not stop.

This fix handles the problems by the following changes:

Ensure the completion of server startup. This is done by calling DisconnectPeer with an invalid pubKey and checking the result. DisconnectPeer returns an error ("chain backend is still syncing") until the server is marked started.
Remove the t.parallel() from TestChannelLinkBidirectionalOneHopePayments. This test is CPU intensive and creates ~966 go routings. While the test itself is OK it has impact on other tests that are running in parallel. In these other tests we can find sleep(time) that is used for sync. Due to this test the sleep(time) may be too short and may cause these tests to fail.
Properly set timeout (context) in tests.
Temporary fix for issue # 1496
Indicates “server started” at the end of the Start() function and not at the start of it.

As an implication of this fix, we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet.

Issue #1496 is important and may cause errors in other places too.

offerm · 2018-07-01T21:47:20Z

@Roasbeef If this PR passes all checks, it is ready to me merged.

Let me know if you have any comments.

Offer

offerm · 2018-07-03T13:31:48Z

@Roasbeef This is ready as far as I'm concern.

Offer

When running Travis integration tests, sometimes there are random errors: 1) Nodes fail to stop 2) A node can't execute RPC call since the server is not running yet. 3) issue #1496 The reason for the second error is a race between LND and the test. LND opens the RPC port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases, there are followup RPC calls (like listpeers) which may get an error if the server is not ready. The reason for the first problem is a race between LND startup and RPC stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead to a situation that the LND process will not stop. This fix handles the problems by the following changes: 1. Ensure the completion of server startup. This is done by calling DisconnectPeer with an invalid pubKey and checking the result. DisconnectPeer returns an error ("chain backend is still syncing") until the server is marked started. 2. Remove the t.parallel() from TestChannelLinkBidirectionalOneHopePayments. This test is CPU intensive and creates ~966 go routings. While the test itself is OK it has impact on other tests that are running in parallel. In these other tests we can find sleep(time) that is used for sync. Due to this test the sleep(time) may be too short and may cause these tests to fail. 3. Properly set timeout (context) in tests. 4. Temporary fix for issue # 1496 5. Indicates “server started” at the end of the Start() function and not at the start of it. As an implication of this fix, we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet. Issue #1496 is important and may cause errors in other places too.

Over come a problem with reflect.DeepEqual that identifies empty array and unallocated array differently this causes unit test to fail from time to time if the number of ShortChanIDs is 0 (selected randomly between 0 to 5000)

Implement error checking in Makefile Prevent error (Failures) when running without tags="debug"

Overcome a problem with reflect.DeepEqual that identifies empty array and unallocated array differently this causes unit test to fail from time to time if the number of ShortChanIDs is 0 (selected randomly between 0 to 5000) (cherry picked from commit eca7586)

TestLogTicker verifies that the logTicker ticked by waiting 11 seconds after the start of switch without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)

verify that the logTicker and fwdEventTicker ticked by waiting 16 seconds after the start of switch. without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)

wpaulino · 2018-07-07T01:27:49Z

htlcswitch/link_test.go

@@ -1686,6 +1689,9 @@ func updateState(batchTick chan time.Time, link *channelLink,
 // sleep in this test and the one below
 func TestChannelLinkBandwidthConsistency(t *testing.T) {
 	t.Parallel()
+	if !hodl.DebugBuild{


wpaulino · 2018-07-07T01:29:44Z

lnd.go

@@ -568,45 +568,43 @@ func lndMain() error {
 		}()
 	}

-	// If we're not in simnet mode, We'll wait until we're fully synced to
+	// We'll wait until we're fully synced to


nit: expand comment to 80 chars per line.

wpaulino · 2018-07-07T01:31:27Z

lnd_test.go

@@ -8759,7 +8774,7 @@ func testSwitchOfflineDelivery(net *lntest.NetworkHarness, t *harnessTest) {
 				Index: chanPoint.OutputIndex,
 			}

-			ctxt, _ = context.WithTimeout(ctxb, timeout)
+			ctxt, _ = context.WithTimeout(ctxb, time.Duration(time.Second * 15))


nit: timeout is 15 seconds so no need to modify this line.

wpaulino · 2018-07-07T01:31:38Z

lntest/harness.go

@@ -238,7 +238,15 @@ func (n *NetworkHarness) TearDownAll() error {
 // current instance of the network harness. The created node is running, but
 // not yet connected to other nodes within the network.
 func (n *NetworkHarness) NewNode(name string, extraArgs []string) (*HarnessNode, error) {
-	return n.newNode(name, extraArgs, false)
+	node, err := n.newNode(name, extraArgs, false)
+


nit: extra newline.

wpaulino · 2018-07-07T01:37:57Z

lnwire/reply_channel_range.go

@@ -46,7 +46,12 @@ func (c *ReplyChannelRange) Decode(r io.Reader, pver uint32) error {
 		return err
 	}

-	c.EncodingType, c.ShortChanIDs, err = decodeShortChanIDs(r)
+	// special handling to avoid error deep compare


This has been fixed in the latest master.

Nop, this was only partially fixed (for line 669).

If you add
numChanIDs = 0

after line 701 at lnwire_test.go (so you force the random number to bu zero) to make it:

numChanIDs := rand.Int31n(5000) numChanIDs = 0 req.ShortChanIDs = make([]ShortChannelID, numChanIDs)

and run the test you will see the problem.

wpaulino · 2018-07-07T01:40:14Z

lntest/node.go

@@ -254,6 +254,13 @@ func (hn *HarnessNode) Name() string {
 	return hn.cfg.Name
 }

+// SetPort can be used to change P2P port of a node
+// TODO (Offer): remove once issue 1496 is resolved
+func (hn *HarnessNode) SetPort(port int)  {


I think it would be best to wait until #1496 is resolved rather than adding this temporary hack.

IMHO it is better to stable the testing infrastructure ASAP and remoe this hack as part of the solution to #1496 . Without this there will still be random errors.

wpaulino · 2018-07-07T01:48:36Z

lntest/harness.go

+// before the server started.
+// TODO (Offer): replace this hack with a field in getInfo that indicates that the server staretd
+
+func ensureServerStarted (node *HarnessNode) error{


If synced_with_chain is true within the getinfo response, then we can most likely guarantee that the server has started.

If synced_with_chain is true within the getinfo response, then we can most likely guarantee that the server has started.

But if we are wrong with out assumption and the server didn't stary yet, we will have an error.

The synced_with_chain is turned on before the server starts. There is a gap in between that can cause problems.
For example, if any test is call to DisconnectPeer there will be a failure since DisconnectPeer is starting by checking if !r.server.Started()

This is one of the reason why integration tests fail randomally.

Another example is trying to stop the server based on synced_with_chain before the server actually started. This can lead to a deadlock.

wpaulino · 2018-07-07T01:53:44Z

htlcswitch/link_test.go

+	// TODO (offer): temporary disabled parallel execution to avoid impact on other tests that are using sleep(x) to sync excution
+	// TODO (offer): this test creates 966 parallel go routine which puts a lot of CPU pressure. Test itself is OK but impact on other tests is huge
+	// TODO (offer): The need of such high number of parallel routine should be discussed
+	//t.Parallel()


I'm fine with removing t.Parallel() for this use case.

wpaulino · 2018-07-07T01:55:58Z

server.go

@@ -777,6 +777,10 @@ func (s *server) Start() error {
 		srvrLog.Infof("Auto peer bootstrapping is disabled")
 	}

+	if !atomic.CompareAndSwapInt32(&s.started, 0, 1) {


What's the reason for this change?

This is needed in order to avoid a race between stop() and start() of the server (and other calls which assume that the server started when s.started is 1.
The current code changes s.started from zero to one at the beginning of the start function. This opens the door to races.

By moving it to the end we can be sure that integration tests will not progress until the server actually completed the startup procedure.

wpaulino · 2018-07-07T01:57:38Z

htlcswitch/switch_test.go

+
+// TestLogTicker verifies that the logTicker ticked by waiting 11 seconds after the start of switch
+// without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)
+func TestLogTicker(t *testing.T) {


The coverage stats have been deemed a bit unreliable for some time now, so not sure if this is something we should worry about.

IMHO, the coverage issues should be fixed or the GitHub integration with coversall should be removed.
The core development team may know to ignore coverage stats but other community developers may not know that and may spend time and efforts on trying to resolve coverage "error".
Moreover, due to this issue, a submitted PR may be flagged with a green checkmark one run and with red X the following run, when there was no actual change between the builds.
I believe I found the main reason why coverage data is not reliable and I see no reason why this should not be fixed. Please reconsider.

offerm

@wpaulino thanks for the comments. please see my responses. Will update the branch soon.

offerm · 2018-07-07T13:42:20Z

htlcswitch/switch_test.go

+
+// TestLogTicker verifies that the logTicker ticked by waiting 11 seconds after the start of switch
+// without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)
+func TestLogTicker(t *testing.T) {


IMHO, the coverage issues should be fixed or the GitHub integration with coversall should be removed.
The core development team may know to ignore coverage stats but other community developers may not know that and may spend time and efforts on trying to resolve coverage "error".
Moreover, due to this issue, a submitted PR may be flagged with a green checkmark one run and with red X the following run, when there was no actual change between the builds.
I believe I found the main reason why coverage data is not reliable and I see no reason why this should not be fixed. Please reconsider.

offerm · 2018-07-07T16:43:17Z

lnd.go

@@ -568,45 +568,43 @@ func lndMain() error {
 		}()
 	}

-	// If we're not in simnet mode, We'll wait until we're fully synced to
+	// We'll wait until we're fully synced to


offerm · 2018-07-07T16:44:30Z

lnd_test.go

@@ -8759,7 +8774,7 @@ func testSwitchOfflineDelivery(net *lntest.NetworkHarness, t *harnessTest) {
 				Index: chanPoint.OutputIndex,
 			}

-			ctxt, _ = context.WithTimeout(ctxb, timeout)
+			ctxt, _ = context.WithTimeout(ctxb, time.Duration(time.Second * 15))


offerm · 2018-07-07T16:45:05Z

lntest/harness.go

@@ -238,7 +238,15 @@ func (n *NetworkHarness) TearDownAll() error {
 // current instance of the network harness. The created node is running, but
 // not yet connected to other nodes within the network.
 func (n *NetworkHarness) NewNode(name string, extraArgs []string) (*HarnessNode, error) {
-	return n.newNode(name, extraArgs, false)
+	node, err := n.newNode(name, extraArgs, false)
+


offerm · 2018-07-07T16:55:39Z

lntest/harness.go

+// before the server started.
+// TODO (Offer): replace this hack with a field in getInfo that indicates that the server staretd
+
+func ensureServerStarted (node *HarnessNode) error{


If synced_with_chain is true within the getinfo response, then we can most likely guarantee that the server has started.

But if we are wrong with out assumption and the server didn't stary yet, we will have an error.

The synced_with_chain is turned on before the server starts. There is a gap in between that can cause problems.
For example, if any test is call to DisconnectPeer there will be a failure since DisconnectPeer is starting by checking if !r.server.Started()

This is one of the reason why integration tests fail randomally.

Another example is trying to stop the server based on synced_with_chain before the server actually started. This can lead to a deadlock.

offerm · 2018-07-07T16:57:46Z

lntest/node.go

@@ -254,6 +254,13 @@ func (hn *HarnessNode) Name() string {
 	return hn.cfg.Name
 }

+// SetPort can be used to change P2P port of a node
+// TODO (Offer): remove once issue 1496 is resolved
+func (hn *HarnessNode) SetPort(port int)  {


IMHO it is better to stable the testing infrastructure ASAP and remoe this hack as part of the solution to #1496 . Without this there will still be random errors.

offerm · 2018-07-07T17:02:18Z

server.go

@@ -777,6 +777,10 @@ func (s *server) Start() error {
 		srvrLog.Infof("Auto peer bootstrapping is disabled")
 	}

+	if !atomic.CompareAndSwapInt32(&s.started, 0, 1) {


This is needed in order to avoid a race between stop() and start() of the server (and other calls which assume that the server started when s.started is 1.
The current code changes s.started from zero to one at the beginning of the start function. This opens the door to races.

By moving it to the end we can be sure that integration tests will not progress until the server actually completed the startup procedure.

offerm · 2018-07-07T17:11:26Z

lnwire/reply_channel_range.go

@@ -46,7 +46,12 @@ func (c *ReplyChannelRange) Decode(r io.Reader, pver uint32) error {
 		return err
 	}

-	c.EncodingType, c.ShortChanIDs, err = decodeShortChanIDs(r)
+	// special handling to avoid error deep compare


Nop, this was only partially fixed (for line 669).

If you add
numChanIDs = 0

after line 701 at lnwire_test.go (so you force the random number to bu zero) to make it:

numChanIDs := rand.Int31n(5000) numChanIDs = 0 req.ShortChanIDs = make([]ShortChannelID, numChanIDs)

and run the test you will see the problem.

when starting btcd as part of the tests there is a 10 second timeout that can't be controled from lnd files. This change patches the btcd source code to increase the timeout so Travis will not fail on "unable to set up mining node: connection timeout"

offerm · 2018-07-10T04:37:00Z

@wpaulino @Roasbeef Any progress with this PR?

halseth · 2018-07-10T10:00:01Z

Is this PR fixing several bugs? In its current state it is very hard to review.

It should be broken up into smaller PRs that fixes one thing at a time, the commit history should be cleaned up to make it easier to review and explain what is being fixed exactly.

offerm · 2018-07-10T10:17:05Z

@halseth This PR solves many issues with the build process. You need all of them to be sure that Travis/Coveralls builds work without an issue.

I can break it down to several PRs but I can't be sure that Travis and Coveralls will pass when each change is standalone.

If you agree to review it even if marked as a failure, I will do it.

Alternatively, I can document all these bugs, the reason for them, and how I solved them.

Let me know, please

Offer

offerm · 2018-07-18T19:15:09Z

@halseth ?

halseth · 2018-07-19T07:19:48Z

@offerm We should definitely make each change standalone. If they are all fixing existing errors in master, it shouldn't make the build more likely to fail.

offerm · 2018-07-20T07:40:55Z

@halseth Created 8 different PRs for these changes so it will be super easy to understand.
These are #1591 #1592 #1593 #1594 #1590 #1589 #1588 #1587

As you can see almost all these PRs had travis errors which are not related at all to the change done in the the PR that failed and are fixed by one of the other PRs.

Priority (based on likelihood of showing up):
1: #1588 #1589 #1590 #1591
2: #1594
3: #1592
4: #1587 #1593

halseth · 2018-07-20T08:28:54Z

@offerm Thanks! Should be much easier to review now :)

Closing, and moving the discussion to the relevant PRs.

offerm · 2018-07-24T21:11:49Z

@halseth are these 8 PRs going to be reviewed? So far only one of them got attention

halseth · 2018-07-25T10:56:42Z

They will be reviewed eventually :) The complex ones will take longer to review, obviouslyl

offerm mentioned this pull request Jul 2, 2018

Missing channel announcement #1496

Closed

Roasbeef requested a review from wpaulino July 4, 2018 03:51

offerm and others added 6 commits July 5, 2018 01:04

Missing empty array allocation

7a4b314

Over come a problem with reflect.DeepEqual that identifies empty array and unallocated array differently this causes unit test to fail from time to time if the number of ShortChanIDs is 0 (selected randomly between 0 to 5000)

Unit cover tests improvments

b3bb16a

Implement error checking in Makefile Prevent error (Failures) when running without tags="debug"

Make sure logTicker handling code is covered

5d09baa

TestLogTicker verifies that the logTicker ticked by waiting 11 seconds after the start of switch without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)

Make sure logTicker and fwdEventTicker handling code is covered

548cb43

verify that the logTicker and fwdEventTicker ticked by waiting 16 seconds after the start of switch. without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)

wpaulino suggested changes Jul 7, 2018

View reviewed changes

Peer review comments

a838384

offerm commented Jul 7, 2018

View reviewed changes

Allow longer startup of btcd

af71142

when starting btcd as part of the tests there is a 10 second timeout that can't be controled from lnd files. This change patches the btcd source code to increase the timeout so Travis will not fail on "unable to set up mining node: connection timeout"

offerm mentioned this pull request Jul 10, 2018

Question: Makefile - Travis + coveralls #1378

Closed

Roasbeef added P2 should be fixed if one has time tests travis Modifications to the Travis CI system needs review PR needs review by regular contributors needs testing PR hasn't yet been actively tested on testnet/mainnet labels Jul 11, 2018

halseth closed this Jul 20, 2018

offerm mentioned this pull request Jul 31, 2018

Travis - set context timeouts #1590

Merged

offerm mentioned this pull request Aug 17, 2018

lntest/node: persist network subscription state across restarts #1733

Merged

Travis - Fix integration tests random errors #1493

Travis - Fix integration tests random errors #1493

Conversation

offerm commented Jul 1, 2018 • edited Loading

offerm commented Jul 1, 2018

offerm commented Jul 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

offerm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

offerm commented Jul 10, 2018

halseth commented Jul 10, 2018

offerm commented Jul 10, 2018

offerm commented Jul 18, 2018

halseth commented Jul 19, 2018

offerm commented Jul 20, 2018

halseth commented Jul 20, 2018

offerm commented Jul 24, 2018

halseth commented Jul 25, 2018

offerm commented Jul 1, 2018 •

edited

Loading