Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis - Fix integration tests random errors #1493

Closed
wants to merge 8 commits into from
Closed

Travis - Fix integration tests random errors #1493

wants to merge 8 commits into from

Conversation

offerm
Copy link
Contributor

@offerm offerm commented Jul 1, 2018

Prevents travis random failures (race)
When running Travis integration tests, sometimes there are random errors:

  1. Nodes fail to stop
  2. A node can't execute RPC call since the server is not running yet.
  3. issue Missing channel announcement #1496

The reason for the second error is a race between LND and the test. LND opens the RPC port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases, there are followup RPC calls (like listpeers) which may get an error if the server is not ready.

The reason for the first problem is a race between LND startup and RPC stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead to a situation that the LND process will not stop.

This fix handles the problems by the following changes:

  1. Ensure the completion of server startup. This is done by calling DisconnectPeer with an invalid pubKey and checking the result. DisconnectPeer returns an error ("chain backend is still syncing") until the server is marked started.

  2. Remove the t.parallel() from TestChannelLinkBidirectionalOneHopePayments. This test is CPU intensive and creates ~966 go routings. While the test itself is OK it has impact on other tests that are running in parallel. In these other tests we can find sleep(time) that is used for sync. Due to this test the sleep(time) may be too short and may cause these tests to fail.

  3. Properly set timeout (context) in tests.

  4. Temporary fix for issue # 1496

  5. Indicates “server started” at the end of the Start() function and not at the start of it.

As an implication of this fix, we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet.

Issue #1496 is important and may cause errors in other places too.

@offerm
Copy link
Contributor Author

offerm commented Jul 1, 2018

@Roasbeef If this PR passes all checks, it is ready to me merged.

Let me know if you have any comments.

Offer

@offerm
Copy link
Contributor Author

offerm commented Jul 3, 2018

@Roasbeef This is ready as far as I'm concern.

Offer

@Roasbeef Roasbeef requested a review from wpaulino July 4, 2018 03:51
offerm and others added 6 commits July 5, 2018 01:04
When running Travis integration tests, sometimes there are random errors:

1) Nodes fail to stop
2) A node can't execute RPC call since the server is not running yet.
3) issue #1496

The reason for the second error is a race between LND and the test. LND opens the RPC port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases, there are followup RPC calls (like listpeers) which may get an error if the server is not ready.

The reason for the first problem is a race between LND startup and RPC stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead to a situation that the LND process will not stop.

This fix handles the problems by the following changes:

 1. Ensure the completion of server startup. This is done by calling DisconnectPeer with an invalid pubKey and checking the result. DisconnectPeer returns an error ("chain backend is still syncing") until the server is marked started.

 2. Remove the t.parallel() from TestChannelLinkBidirectionalOneHopePayments. This test is CPU intensive and creates ~966 go routings. While the test itself is OK it has impact on other tests that are running in parallel. In these other tests we can find sleep(time) that is used for sync. Due to this test the sleep(time) may be too short and may cause these tests to fail.

3. Properly set timeout (context) in tests.

4. Temporary  fix for issue # 1496

5. Indicates “server started” at the end of the Start() function and not at the start of it.

As an implication of this fix, we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet.

Issue #1496 is important and may cause errors in other places too.
Over come a problem with reflect.DeepEqual that identifies empty array and unallocated array differently
this causes unit test to fail from time to time if the number of ShortChanIDs is 0 (selected randomly between 0 to 5000)
Implement error checking in Makefile
Prevent error (Failures) when running without tags="debug"
Overcome a problem with reflect.DeepEqual that identifies empty array and unallocated array differently
this causes unit test to fail from time to time if the number of ShortChanIDs is 0 (selected randomly between 0 to 5000)

(cherry picked from commit eca7586)
TestLogTicker verifies that the logTicker ticked by waiting 11 seconds after the start of switch without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)
verify that the logTicker and fwdEventTicker ticked by waiting 16 seconds after the start of switch. without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)
@@ -1686,6 +1689,9 @@ func updateState(batchTick chan time.Time, link *channelLink,
// sleep in this test and the one below
func TestChannelLinkBandwidthConsistency(t *testing.T) {
t.Parallel()
if !hodl.DebugBuild{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

lnd.go Outdated
@@ -568,45 +568,43 @@ func lndMain() error {
}()
}

// If we're not in simnet mode, We'll wait until we're fully synced to
// We'll wait until we're fully synced to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: expand comment to 80 chars per line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

lnd_test.go Outdated
@@ -8759,7 +8774,7 @@ func testSwitchOfflineDelivery(net *lntest.NetworkHarness, t *harnessTest) {
Index: chanPoint.OutputIndex,
}

ctxt, _ = context.WithTimeout(ctxb, timeout)
ctxt, _ = context.WithTimeout(ctxb, time.Duration(time.Second * 15))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: timeout is 15 seconds so no need to modify this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -238,7 +238,15 @@ func (n *NetworkHarness) TearDownAll() error {
// current instance of the network harness. The created node is running, but
// not yet connected to other nodes within the network.
func (n *NetworkHarness) NewNode(name string, extraArgs []string) (*HarnessNode, error) {
return n.newNode(name, extraArgs, false)
node, err := n.newNode(name, extraArgs, false)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra newline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -46,7 +46,12 @@ func (c *ReplyChannelRange) Decode(r io.Reader, pver uint32) error {
return err
}

c.EncodingType, c.ShortChanIDs, err = decodeShortChanIDs(r)
// special handling to avoid error deep compare
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed in the latest master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nop, this was only partially fixed (for line 669).

If you add
numChanIDs = 0

after line 701 at lnwire_test.go (so you force the random number to bu zero) to make it:

			numChanIDs := rand.Int31n(5000)
			numChanIDs = 0

			req.ShortChanIDs = make([]ShortChannelID, numChanIDs)

and run the test you will see the problem.

@@ -254,6 +254,13 @@ func (hn *HarnessNode) Name() string {
return hn.cfg.Name
}

// SetPort can be used to change P2P port of a node
// TODO (Offer): remove once issue 1496 is resolved
func (hn *HarnessNode) SetPort(port int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to wait until #1496 is resolved rather than adding this temporary hack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it is better to stable the testing infrastructure ASAP and remoe this hack as part of the solution to #1496 . Without this there will still be random errors.

// before the server started.
// TODO (Offer): replace this hack with a field in getInfo that indicates that the server staretd

func ensureServerStarted (node *HarnessNode) error{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If synced_with_chain is true within the getinfo response, then we can most likely guarantee that the server has started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If synced_with_chain is true within the getinfo response, then we can most likely guarantee that the server has started.

But if we are wrong with out assumption and the server didn't stary yet, we will have an error.

The synced_with_chain is turned on before the server starts. There is a gap in between that can cause problems.
For example, if any test is call to DisconnectPeer there will be a failure since DisconnectPeer is starting by checking if !r.server.Started()

This is one of the reason why integration tests fail randomally.

Another example is trying to stop the server based on synced_with_chain before the server actually started. This can lead to a deadlock.

// TODO (offer): temporary disabled parallel execution to avoid impact on other tests that are using sleep(x) to sync excution
// TODO (offer): this test creates 966 parallel go routine which puts a lot of CPU pressure. Test itself is OK but impact on other tests is huge
// TODO (offer): The need of such high number of parallel routine should be discussed
//t.Parallel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with removing t.Parallel() for this use case.

@@ -777,6 +777,10 @@ func (s *server) Start() error {
srvrLog.Infof("Auto peer bootstrapping is disabled")
}

if !atomic.CompareAndSwapInt32(&s.started, 0, 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed in order to avoid a race between stop() and start() of the server (and other calls which assume that the server started when s.started is 1.
The current code changes s.started from zero to one at the beginning of the start function. This opens the door to races.

By moving it to the end we can be sure that integration tests will not progress until the server actually completed the startup procedure.


// TestLogTicker verifies that the logTicker ticked by waiting 11 seconds after the start of switch
// without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)
func TestLogTicker(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The coverage stats have been deemed a bit unreliable for some time now, so not sure if this is something we should worry about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the coverage issues should be fixed or the GitHub integration with coversall should be removed.
The core development team may know to ignore coverage stats but other community developers may not know that and may spend time and efforts on trying to resolve coverage "error".
Moreover, due to this issue, a submitted PR may be flagged with a green checkmark one run and with red X the following run, when there was no actual change between the builds.
I believe I found the main reason why coverage data is not reliable and I see no reason why this should not be fixed. Please reconsider.

Copy link
Contributor Author

@offerm offerm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpaulino thanks for the comments. please see my responses. Will update the branch soon.


// TestLogTicker verifies that the logTicker ticked by waiting 11 seconds after the start of switch
// without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)
func TestLogTicker(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the coverage issues should be fixed or the GitHub integration with coversall should be removed.
The core development team may know to ignore coverage stats but other community developers may not know that and may spend time and efforts on trying to resolve coverage "error".
Moreover, due to this issue, a submitted PR may be flagged with a green checkmark one run and with red X the following run, when there was no actual change between the builds.
I believe I found the main reason why coverage data is not reliable and I see no reason why this should not be fixed. Please reconsider.

lnd.go Outdated
@@ -568,45 +568,43 @@ func lndMain() error {
}()
}

// If we're not in simnet mode, We'll wait until we're fully synced to
// We'll wait until we're fully synced to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

lnd_test.go Outdated
@@ -8759,7 +8774,7 @@ func testSwitchOfflineDelivery(net *lntest.NetworkHarness, t *harnessTest) {
Index: chanPoint.OutputIndex,
}

ctxt, _ = context.WithTimeout(ctxb, timeout)
ctxt, _ = context.WithTimeout(ctxb, time.Duration(time.Second * 15))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -238,7 +238,15 @@ func (n *NetworkHarness) TearDownAll() error {
// current instance of the network harness. The created node is running, but
// not yet connected to other nodes within the network.
func (n *NetworkHarness) NewNode(name string, extraArgs []string) (*HarnessNode, error) {
return n.newNode(name, extraArgs, false)
node, err := n.newNode(name, extraArgs, false)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// before the server started.
// TODO (Offer): replace this hack with a field in getInfo that indicates that the server staretd

func ensureServerStarted (node *HarnessNode) error{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If synced_with_chain is true within the getinfo response, then we can most likely guarantee that the server has started.

But if we are wrong with out assumption and the server didn't stary yet, we will have an error.

The synced_with_chain is turned on before the server starts. There is a gap in between that can cause problems.
For example, if any test is call to DisconnectPeer there will be a failure since DisconnectPeer is starting by checking if !r.server.Started()

This is one of the reason why integration tests fail randomally.

Another example is trying to stop the server based on synced_with_chain before the server actually started. This can lead to a deadlock.

@@ -254,6 +254,13 @@ func (hn *HarnessNode) Name() string {
return hn.cfg.Name
}

// SetPort can be used to change P2P port of a node
// TODO (Offer): remove once issue 1496 is resolved
func (hn *HarnessNode) SetPort(port int) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it is better to stable the testing infrastructure ASAP and remoe this hack as part of the solution to #1496 . Without this there will still be random errors.

@@ -777,6 +777,10 @@ func (s *server) Start() error {
srvrLog.Infof("Auto peer bootstrapping is disabled")
}

if !atomic.CompareAndSwapInt32(&s.started, 0, 1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed in order to avoid a race between stop() and start() of the server (and other calls which assume that the server started when s.started is 1.
The current code changes s.started from zero to one at the beginning of the start function. This opens the door to races.

By moving it to the end we can be sure that integration tests will not progress until the server actually completed the startup procedure.

@@ -46,7 +46,12 @@ func (c *ReplyChannelRange) Decode(r io.Reader, pver uint32) error {
return err
}

c.EncodingType, c.ShortChanIDs, err = decodeShortChanIDs(r)
// special handling to avoid error deep compare
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nop, this was only partially fixed (for line 669).

If you add
numChanIDs = 0

after line 701 at lnwire_test.go (so you force the random number to bu zero) to make it:

			numChanIDs := rand.Int31n(5000)
			numChanIDs = 0

			req.ShortChanIDs = make([]ShortChannelID, numChanIDs)

and run the test you will see the problem.

when starting btcd as part of the tests there is a 10 second timeout that can't be controled from lnd files.
This change patches the btcd source code to increase the timeout so Travis will not fail on
"unable to set up mining node: connection timeout"
@offerm
Copy link
Contributor Author

offerm commented Jul 10, 2018

@wpaulino @Roasbeef Any progress with this PR?

@halseth
Copy link
Contributor

halseth commented Jul 10, 2018

Is this PR fixing several bugs? In its current state it is very hard to review.

It should be broken up into smaller PRs that fixes one thing at a time, the commit history should be cleaned up to make it easier to review and explain what is being fixed exactly.

@offerm
Copy link
Contributor Author

offerm commented Jul 10, 2018

@halseth This PR solves many issues with the build process. You need all of them to be sure that Travis/Coveralls builds work without an issue.

I can break it down to several PRs but I can't be sure that Travis and Coveralls will pass when each change is standalone.

If you agree to review it even if marked as a failure, I will do it.

Alternatively, I can document all these bugs, the reason for them, and how I solved them.

Let me know, please

Offer

@Roasbeef Roasbeef added P2 should be fixed if one has time tests travis Modifications to the Travis CI system needs review PR needs review by regular contributors needs testing PR hasn't yet been actively tested on testnet/mainnet labels Jul 11, 2018
@offerm
Copy link
Contributor Author

offerm commented Jul 18, 2018

@halseth ?

@halseth
Copy link
Contributor

halseth commented Jul 19, 2018

@offerm We should definitely make each change standalone. If they are all fixing existing errors in master, it shouldn't make the build more likely to fail.

@offerm
Copy link
Contributor Author

offerm commented Jul 20, 2018

@halseth Created 8 different PRs for these changes so it will be super easy to understand.
These are #1591 #1592 #1593 #1594 #1590 #1589 #1588 #1587

As you can see almost all these PRs had travis errors which are not related at all to the change done in the the PR that failed and are fixed by one of the other PRs.

Priority (based on likelihood of showing up):
1: #1588 #1589 #1590 #1591
2: #1594
3: #1592
4: #1587 #1593

@halseth
Copy link
Contributor

halseth commented Jul 20, 2018

@offerm Thanks! Should be much easier to review now :)

Closing, and moving the discussion to the relevant PRs.

@halseth halseth closed this Jul 20, 2018
@offerm
Copy link
Contributor Author

offerm commented Jul 24, 2018

@halseth are these 8 PRs going to be reviewed? So far only one of them got attention

@halseth
Copy link
Contributor

halseth commented Jul 25, 2018

They will be reviewed eventually :) The complex ones will take longer to review, obviouslyl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs review PR needs review by regular contributors needs testing PR hasn't yet been actively tested on testnet/mainnet P2 should be fixed if one has time tests travis Modifications to the Travis CI system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants