-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix (and test) threaded payment retries #2009
Fix (and test) threaded payment retries #2009
Conversation
New test is failing |
c54dcbf
to
c58938b
Compare
Oops, new test was a bite racy itself. Should be good now. |
Codecov ReportBase: 87.25% // Head: 87.87% // Increases project coverage by
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## main #2009 +/- ##
==========================================
+ Coverage 87.25% 87.87% +0.62%
==========================================
Files 100 101 +1
Lines 44110 47662 +3552
Branches 44110 47662 +3552
==========================================
+ Hits 38488 41885 +3397
- Misses 5622 5777 +155
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Grr hit thread 'ln::payment_tests::test_threaded_payment_retries' panicked at 'assertion failed: peer.try_lock().is_ok()', lightning/src/ln/channelmanager.rs:3732:17 which means I have to update the lockorder tests again... |
c58938b
to
db8ace2
Compare
Done, hopefully it passes now, had to add two more commits upfront, though. |
All CI failures were due to crates.io being down, I kicked them. |
pub(crate) enum LockHeldState { | ||
HeldByThread, | ||
NotHeldByThread, | ||
Unsupported, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a comment explaining when a lock held state cannot be determined would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better, I cfg-flagged it so its not even there for test builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, some nits and questions :)
#[cfg(feature = "std")] // If we put this on the `if`, we get "attributes are not yet allowed on `if` expressions" on 1.41.1 | ||
impl<'a> Drop for TestRouter<'a> { | ||
fn drop(&mut self) { | ||
if std::thread::panicking() { | ||
return; | ||
#[cfg(feature = "std")] { | ||
if std::thread::panicking() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come it's okay to go against this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added that comment originally to get CI to pass, but this way also works and is better so we can get test coverage on no-std. Jeff pointed out the fix here: #1916 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh I didn't even realize it was using an extra scope, cool 👍
@@ -248,6 +262,13 @@ impl<T> Mutex<T> { | |||
} | |||
} | |||
|
|||
impl <T> LockTestExt for Mutex<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a general question, but how come rust-lightning has its own synchronization primitives like Mutex
and RwLock
? From briefly looking at the fields I'm guessing it's to help with development/testing purposes, so I was also wondering if that has any meaningful impact on performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, we don't actually use our own custom locks, we always stub out to the std ones, except in two cases:
a) in tests, we wrap them in all kinds of debug testing to ensure we don't have lockorder violations,
b) the FairRwLock provides fairness guarantees that the std RwLock does not (though its a relatively thin wrapper around RwLock).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just took a better look at the locks and that makes more sense now 👍
#[cfg(feature = "std")] // If we put this on the `if`, we get "attributes are not yet allowed on `if` expressions" on 1.41.1 | ||
impl<'a> Drop for TestRouter<'a> { | ||
fn drop(&mut self) { | ||
if std::thread::panicking() { | ||
return; | ||
#[cfg(feature = "std")] { | ||
if std::thread::panicking() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added that comment originally to get CI to pass, but this way also works and is better so we can get test coverage on no-std. Jeff pointed out the fix here: #1916 (comment)
@@ -478,6 +480,7 @@ impl OutboundPayments { | |||
FH: Fn() -> Vec<ChannelDetails>, | |||
L::Target: Logger, | |||
{ | |||
let _single_thread = self.retry_lock.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return early if we fail to acquire the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh? I mean, we could, but I don't want to bother adding more code than necessary (and we probably always want to at least run once, so we'd have to do the whole rigamarole that peer_handler does), and we really shouldnt have two threads calling this, at least as long as we only generate one PendingHTLCsForwardable event.
db8ace2
to
00084e2
Compare
Rebased and addressed comments. |
00084e2
to
0a5a906
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after squash
0a5a906
to
77a0f77
Compare
Squashed and removed the spurious newline. |
In anticipation of the next commit(s) adding threaded tests, we need to ensure our lockorder checks work fine with multiple threads. Sadly, currently we have tests in the form `assert!(mutex.try_lock().is_ok())` to assert that a given mutex is not locked by the caller to a function. The fix is rather simple given we already track mutexes locked by a thread in our `debug_sync` logic - simply replace the check with a new extension trait which (for test builds) checks the locked state by only looking at what was locked by the current thread.
...rather than only in std.
The new in-`ChannelManager` retries logic does retries as two separate steps, under two separate locks - first it calculates the amount that needs to be retried, then it actually sends it. Because the first step doesn't udpate the amount, a second thread may come along and calculate the same amount and end up retrying duplicatively. Because we generally shouldn't ever be processing retries at the same time, the fix is trivial - simply take a lock at the top of the retry loop and hold it until we're done.
77a0f77
to
d986329
Compare
Removed spurious cfg tag inclusion: $ git diff-tree -U2 77a0f7746 d98632973
diff --git a/lightning/src/sync/mod.rs b/lightning/src/sync/mod.rs
index bbf3998f7..50ef40e29 100644
--- a/lightning/src/sync/mod.rs
+++ b/lightning/src/sync/mod.rs
@@ -20,7 +20,4 @@ pub use debug_sync::*;
mod test_lockorder_checks;
-#[cfg(all(any(feature = "_bench_unstable", not(test)), feature = "std"))]
-
-
#[cfg(all(feature = "std", any(feature = "_bench_unstable", not(test))))]
pub(crate) mod fairrwlock; |
The new in-
ChannelManager
retries logic does retries as twoseparate steps, under two separate locks - first it calculates
the amount that needs to be retried, then it actually sends it.
Because the first step doesn't udpate the amount, a second thread
may come along and calculate the same amount and end up retrying
duplicatively.
Because we generally shouldn't ever be processing retries at the
same time, the fix is trivial - simply take a lock at the top of
the retry loop and hold it until we're done.
This resolves (I believe) all the pending followups from #1916.