-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Travis - Fix integration tests random errors #1493
Conversation
@Roasbeef If this PR passes all checks, it is ready to me merged. Let me know if you have any comments. Offer |
@Roasbeef This is ready as far as I'm concern. Offer |
When running Travis integration tests, sometimes there are random errors: 1) Nodes fail to stop 2) A node can't execute RPC call since the server is not running yet. 3) issue #1496 The reason for the second error is a race between LND and the test. LND opens the RPC port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases, there are followup RPC calls (like listpeers) which may get an error if the server is not ready. The reason for the first problem is a race between LND startup and RPC stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead to a situation that the LND process will not stop. This fix handles the problems by the following changes: 1. Ensure the completion of server startup. This is done by calling DisconnectPeer with an invalid pubKey and checking the result. DisconnectPeer returns an error ("chain backend is still syncing") until the server is marked started. 2. Remove the t.parallel() from TestChannelLinkBidirectionalOneHopePayments. This test is CPU intensive and creates ~966 go routings. While the test itself is OK it has impact on other tests that are running in parallel. In these other tests we can find sleep(time) that is used for sync. Due to this test the sleep(time) may be too short and may cause these tests to fail. 3. Properly set timeout (context) in tests. 4. Temporary fix for issue # 1496 5. Indicates “server started” at the end of the Start() function and not at the start of it. As an implication of this fix, we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet. Issue #1496 is important and may cause errors in other places too.
Over come a problem with reflect.DeepEqual that identifies empty array and unallocated array differently this causes unit test to fail from time to time if the number of ShortChanIDs is 0 (selected randomly between 0 to 5000)
Implement error checking in Makefile Prevent error (Failures) when running without tags="debug"
Overcome a problem with reflect.DeepEqual that identifies empty array and unallocated array differently this causes unit test to fail from time to time if the number of ShortChanIDs is 0 (selected randomly between 0 to 5000) (cherry picked from commit eca7586)
TestLogTicker verifies that the logTicker ticked by waiting 11 seconds after the start of switch without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)
verify that the logTicker and fwdEventTicker ticked by waiting 16 seconds after the start of switch. without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail)
@@ -1686,6 +1689,9 @@ func updateState(batchTick chan time.Time, link *channelLink, | |||
// sleep in this test and the one below | |||
func TestChannelLinkBandwidthConsistency(t *testing.T) { | |||
t.Parallel() | |||
if !hodl.DebugBuild{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
lnd.go
Outdated
@@ -568,45 +568,43 @@ func lndMain() error { | |||
}() | |||
} | |||
|
|||
// If we're not in simnet mode, We'll wait until we're fully synced to | |||
// We'll wait until we're fully synced to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: expand comment to 80 chars per line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lnd_test.go
Outdated
@@ -8759,7 +8774,7 @@ func testSwitchOfflineDelivery(net *lntest.NetworkHarness, t *harnessTest) { | |||
Index: chanPoint.OutputIndex, | |||
} | |||
|
|||
ctxt, _ = context.WithTimeout(ctxb, timeout) | |||
ctxt, _ = context.WithTimeout(ctxb, time.Duration(time.Second * 15)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: timeout
is 15 seconds so no need to modify this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
lntest/harness.go
Outdated
@@ -238,7 +238,15 @@ func (n *NetworkHarness) TearDownAll() error { | |||
// current instance of the network harness. The created node is running, but | |||
// not yet connected to other nodes within the network. | |||
func (n *NetworkHarness) NewNode(name string, extraArgs []string) (*HarnessNode, error) { | |||
return n.newNode(name, extraArgs, false) | |||
node, err := n.newNode(name, extraArgs, false) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra newline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -46,7 +46,12 @@ func (c *ReplyChannelRange) Decode(r io.Reader, pver uint32) error { | |||
return err | |||
} | |||
|
|||
c.EncodingType, c.ShortChanIDs, err = decodeShortChanIDs(r) | |||
// special handling to avoid error deep compare |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been fixed in the latest master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nop, this was only partially fixed (for line 669).
If you add
numChanIDs = 0
after line 701 at lnwire_test.go (so you force the random number to bu zero) to make it:
numChanIDs := rand.Int31n(5000)
numChanIDs = 0
req.ShortChanIDs = make([]ShortChannelID, numChanIDs)
and run the test you will see the problem.
@@ -254,6 +254,13 @@ func (hn *HarnessNode) Name() string { | |||
return hn.cfg.Name | |||
} | |||
|
|||
// SetPort can be used to change P2P port of a node | |||
// TODO (Offer): remove once issue 1496 is resolved | |||
func (hn *HarnessNode) SetPort(port int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to wait until #1496 is resolved rather than adding this temporary hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it is better to stable the testing infrastructure ASAP and remoe this hack as part of the solution to #1496 . Without this there will still be random errors.
// before the server started. | ||
// TODO (Offer): replace this hack with a field in getInfo that indicates that the server staretd | ||
|
||
func ensureServerStarted (node *HarnessNode) error{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If synced_with_chain
is true within the getinfo
response, then we can most likely guarantee that the server has started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If synced_with_chain is true within the getinfo response, then we can most likely guarantee that the server has started.
But if we are wrong with out assumption and the server didn't stary yet, we will have an error.
The synced_with_chain is turned on before the server starts. There is a gap in between that can cause problems.
For example, if any test is call to DisconnectPeer
there will be a failure since DisconnectPeer
is starting by checking if !r.server.Started()
This is one of the reason why integration tests fail randomally.
Another example is trying to stop the server based on synced_with_chain
before the server actually started. This can lead to a deadlock.
// TODO (offer): temporary disabled parallel execution to avoid impact on other tests that are using sleep(x) to sync excution | ||
// TODO (offer): this test creates 966 parallel go routine which puts a lot of CPU pressure. Test itself is OK but impact on other tests is huge | ||
// TODO (offer): The need of such high number of parallel routine should be discussed | ||
//t.Parallel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with removing t.Parallel()
for this use case.
@@ -777,6 +777,10 @@ func (s *server) Start() error { | |||
srvrLog.Infof("Auto peer bootstrapping is disabled") | |||
} | |||
|
|||
if !atomic.CompareAndSwapInt32(&s.started, 0, 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed in order to avoid a race between stop() and start() of the server (and other calls which assume that the server started when s.started
is 1.
The current code changes s.started
from zero to one at the beginning of the start function. This opens the door to races.
By moving it to the end we can be sure that integration tests will not progress until the server actually completed the startup procedure.
|
||
// TestLogTicker verifies that the logTicker ticked by waiting 11 seconds after the start of switch | ||
// without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail) | ||
func TestLogTicker(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coverage stats have been deemed a bit unreliable for some time now, so not sure if this is something we should worry about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, the coverage issues should be fixed or the GitHub integration with coversall should be removed.
The core development team may know to ignore coverage stats but other community developers may not know that and may spend time and efforts on trying to resolve coverage "error".
Moreover, due to this issue, a submitted PR may be flagged with a green checkmark one run and with red X the following run, when there was no actual change between the builds.
I believe I found the main reason why coverage data is not reliable and I see no reason why this should not be fixed. Please reconsider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wpaulino thanks for the comments. please see my responses. Will update the branch soon.
|
||
// TestLogTicker verifies that the logTicker ticked by waiting 11 seconds after the start of switch | ||
// without this the code handling the logTicker will not be called and there will be a drop in coverage stats (build fail) | ||
func TestLogTicker(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, the coverage issues should be fixed or the GitHub integration with coversall should be removed.
The core development team may know to ignore coverage stats but other community developers may not know that and may spend time and efforts on trying to resolve coverage "error".
Moreover, due to this issue, a submitted PR may be flagged with a green checkmark one run and with red X the following run, when there was no actual change between the builds.
I believe I found the main reason why coverage data is not reliable and I see no reason why this should not be fixed. Please reconsider.
lnd.go
Outdated
@@ -568,45 +568,43 @@ func lndMain() error { | |||
}() | |||
} | |||
|
|||
// If we're not in simnet mode, We'll wait until we're fully synced to | |||
// We'll wait until we're fully synced to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lnd_test.go
Outdated
@@ -8759,7 +8774,7 @@ func testSwitchOfflineDelivery(net *lntest.NetworkHarness, t *harnessTest) { | |||
Index: chanPoint.OutputIndex, | |||
} | |||
|
|||
ctxt, _ = context.WithTimeout(ctxb, timeout) | |||
ctxt, _ = context.WithTimeout(ctxb, time.Duration(time.Second * 15)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
lntest/harness.go
Outdated
@@ -238,7 +238,15 @@ func (n *NetworkHarness) TearDownAll() error { | |||
// current instance of the network harness. The created node is running, but | |||
// not yet connected to other nodes within the network. | |||
func (n *NetworkHarness) NewNode(name string, extraArgs []string) (*HarnessNode, error) { | |||
return n.newNode(name, extraArgs, false) | |||
node, err := n.newNode(name, extraArgs, false) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// before the server started. | ||
// TODO (Offer): replace this hack with a field in getInfo that indicates that the server staretd | ||
|
||
func ensureServerStarted (node *HarnessNode) error{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If synced_with_chain is true within the getinfo response, then we can most likely guarantee that the server has started.
But if we are wrong with out assumption and the server didn't stary yet, we will have an error.
The synced_with_chain is turned on before the server starts. There is a gap in between that can cause problems.
For example, if any test is call to DisconnectPeer
there will be a failure since DisconnectPeer
is starting by checking if !r.server.Started()
This is one of the reason why integration tests fail randomally.
Another example is trying to stop the server based on synced_with_chain
before the server actually started. This can lead to a deadlock.
@@ -254,6 +254,13 @@ func (hn *HarnessNode) Name() string { | |||
return hn.cfg.Name | |||
} | |||
|
|||
// SetPort can be used to change P2P port of a node | |||
// TODO (Offer): remove once issue 1496 is resolved | |||
func (hn *HarnessNode) SetPort(port int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it is better to stable the testing infrastructure ASAP and remoe this hack as part of the solution to #1496 . Without this there will still be random errors.
@@ -777,6 +777,10 @@ func (s *server) Start() error { | |||
srvrLog.Infof("Auto peer bootstrapping is disabled") | |||
} | |||
|
|||
if !atomic.CompareAndSwapInt32(&s.started, 0, 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed in order to avoid a race between stop() and start() of the server (and other calls which assume that the server started when s.started
is 1.
The current code changes s.started
from zero to one at the beginning of the start function. This opens the door to races.
By moving it to the end we can be sure that integration tests will not progress until the server actually completed the startup procedure.
@@ -46,7 +46,12 @@ func (c *ReplyChannelRange) Decode(r io.Reader, pver uint32) error { | |||
return err | |||
} | |||
|
|||
c.EncodingType, c.ShortChanIDs, err = decodeShortChanIDs(r) | |||
// special handling to avoid error deep compare |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nop, this was only partially fixed (for line 669).
If you add
numChanIDs = 0
after line 701 at lnwire_test.go (so you force the random number to bu zero) to make it:
numChanIDs := rand.Int31n(5000)
numChanIDs = 0
req.ShortChanIDs = make([]ShortChannelID, numChanIDs)
and run the test you will see the problem.
when starting btcd as part of the tests there is a 10 second timeout that can't be controled from lnd files. This change patches the btcd source code to increase the timeout so Travis will not fail on "unable to set up mining node: connection timeout"
Is this PR fixing several bugs? In its current state it is very hard to review. It should be broken up into smaller PRs that fixes one thing at a time, the commit history should be cleaned up to make it easier to review and explain what is being fixed exactly. |
@halseth This PR solves many issues with the build process. You need all of them to be sure that Travis/Coveralls builds work without an issue. I can break it down to several PRs but I can't be sure that Travis and Coveralls will pass when each change is standalone. If you agree to review it even if marked as a failure, I will do it. Alternatively, I can document all these bugs, the reason for them, and how I solved them. Let me know, please Offer |
@halseth ? |
@offerm We should definitely make each change standalone. If they are all fixing existing errors in master, it shouldn't make the build more likely to fail. |
@halseth Created 8 different PRs for these changes so it will be super easy to understand. As you can see almost all these PRs had travis errors which are not related at all to the change done in the the PR that failed and are fixed by one of the other PRs. Priority (based on likelihood of showing up): |
@offerm Thanks! Should be much easier to review now :) Closing, and moving the discussion to the relevant PRs. |
@halseth are these 8 PRs going to be reviewed? So far only one of them got attention |
They will be reviewed eventually :) The complex ones will take longer to review, obviouslyl |
Prevents travis random failures (race)
When running Travis integration tests, sometimes there are random errors:
The reason for the second error is a race between LND and the test. LND opens the RPC port before the server is up. This may cause the node.start() call to complete before the LND's server is running. In some cases, there are followup RPC calls (like listpeers) which may get an error if the server is not ready.
The reason for the first problem is a race between LND startup and RPC stop call. Since the RPC port is open before the server startup is done a test may send a stop request (via RPC) before the startup is done. This may lead to a situation that the LND process will not stop.
This fix handles the problems by the following changes:
Ensure the completion of server startup. This is done by calling DisconnectPeer with an invalid pubKey and checking the result. DisconnectPeer returns an error ("chain backend is still syncing") until the server is marked started.
Remove the t.parallel() from TestChannelLinkBidirectionalOneHopePayments. This test is CPU intensive and creates ~966 go routings. While the test itself is OK it has impact on other tests that are running in parallel. In these other tests we can find sleep(time) that is used for sync. Due to this test the sleep(time) may be too short and may cause these tests to fail.
Properly set timeout (context) in tests.
Temporary fix for issue # 1496
Indicates “server started” at the end of the Start() function and not at the start of it.
As an implication of this fix, we can also remove the special handling in lnd.go for using simnet (see diff). This makes the integration tests more realistic for testnet and mainnet.
Issue #1496 is important and may cause errors in other places too.