Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't remove nodes if there's no channel_update for a temp failure #2220

Merged
merged 3 commits into from
Apr 24, 2023

Conversation

TheBlueMatt
Copy link
Collaborator

Previously, we were requiring any UPDATE onion errors to include a channel_update, as the spec mandates[1]. If we see an onion error which is missing one we treat it as a misbehaving node that isn't behaving according to the spec and simply remove the node.

Sadly, it appears at least some versions of CLN are such nodes, and opt to not include channel_update at all if they're returning a temporary_channel_failure. This causes us to completely remove CLN nodes from our graph after they fail to forward our HTLC.

While CLN is violating the spec here, there's not a lot of reason to not allow it, so we go ahead and do so here, treating it simply as any other failure by letting the scorer handle it.

[1] The spec says Please note that the channel_update field is mandatory in messages whose failure_code includes the UPDATE flag however doesn't repeat it in the requirements section so its not crazy that someone missed it when implementing.

@TheBlueMatt TheBlueMatt added this to the 0.0.115 milestone Apr 23, 2023
@TheBlueMatt
Copy link
Collaborator Author

@codecov-commenter
Copy link

codecov-commenter commented Apr 23, 2023

Codecov Report

Patch coverage: 84.00% and project coverage change: -0.02 ⚠️

Comparison is base (bc54441) 91.57% compared to head (95ec48a) 91.56%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2220      +/-   ##
==========================================
- Coverage   91.57%   91.56%   -0.02%     
==========================================
  Files         104      104              
  Lines       51553    51559       +6     
  Branches    51553    51559       +6     
==========================================
- Hits        47212    47210       -2     
- Misses       4341     4349       +8     
Impacted Files Coverage Δ
lightning/src/ln/onion_utils.rs 90.82% <81.81%> (-0.77%) ⬇️
lightning/src/ln/monitor_tests.rs 97.86% <100.00%> (-0.31%) ⬇️
lightning/src/routing/gossip.rs 89.93% <100.00%> (+0.03%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@TheBlueMatt TheBlueMatt mentioned this pull request Apr 24, 2023
wpaulino
wpaulino previously approved these changes Apr 24, 2023
lightning/src/ln/onion_utils.rs Outdated Show resolved Hide resolved
lightning/src/routing/gossip.rs Outdated Show resolved Hide resolved
@TheBlueMatt
Copy link
Collaborator Author

Went ahead and squashed the fixups:

$ git diff-tree -U1 06ceacffe998b103dff4bf2fd478d778562cb8c2 a22227bf1ad9a46403eb9b751630aa6c74a2fc49
diff --git a/lightning/src/ln/onion_utils.rs b/lightning/src/ln/onion_utils.rs
index ac76a88ff..54b6ecdee 100644
--- a/lightning/src/ln/onion_utils.rs
+++ b/lightning/src/ln/onion_utils.rs
@@ -547,3 +547,3 @@ pub(super) fn process_onion_failure<T: secp256k1::Signing, L: Deref>(secp_ctx: &
 										// If the channel_update had a non-zero length (i.e. was
-										// present) but we coulnd't read it, treat it as a total
+										// present) but we couldn't read it, treat it as a total
 										// node failure.
diff --git a/lightning/src/routing/gossip.rs b/lightning/src/routing/gossip.rs
index 7e0788cb8..cc256b167 100644
--- a/lightning/src/routing/gossip.rs
+++ b/lightning/src/routing/gossip.rs
@@ -214,3 +214,3 @@ pub enum NetworkUpdate {
 	/// An error indicating that a channel failed to route a payment, which should be applied via
-	/// [`NetworkGraph::channel_failed`].
+	/// [`NetworkGraph::channel_failed_permanent`] if permanent.
 	ChannelFailure {
@@ -354,5 +354,6 @@ impl<L: Deref> NetworkGraph<L> where L::Target: Logger {
 			NetworkUpdate::ChannelFailure { short_channel_id, is_permanent } => {
-				let action = if is_permanent { "Removing" } else { "Not touching" };
-				log_debug!(self.logger, "{} channel graph entry for {} due to a payment failure.", action, short_channel_id);
-				self.channel_failed(short_channel_id, is_permanent);
+				if is_permanent {
+					log_debug!(self.logger, "Removing channel graph entry for {} due to a payment failure.", short_channel_id);
+					self.channel_failed_permanent(short_channel_id);
+				}
 			},
@@ -1634,10 +1635,6 @@ impl<L: Deref> NetworkGraph<L> where L::Target: Logger {
 
-	/// Marks a channel in the graph as failed if a corresponding HTLC fail was sent.
-	///
-	/// If permanent, removes a channel from the local storage.
-	/// May cause the removal of nodes too, if this was their last channel.
+	/// Marks a channel in the graph as failed permanently.
 	///
-	/// If not permanent, no action is taken as such a failure likely indicates the node simply
-	/// lacked liquidity and your scorer should handle this instead.
-	pub fn channel_failed(&self, short_channel_id: u64, is_permanent: bool) {
+	/// The channel and any node for which this was their last channel are removed from the graph.
+	pub fn channel_failed_permanent(&self, short_channel_id: u64) {
 		#[cfg(feature = "std")]
@@ -1647,20 +1644,14 @@ impl<L: Deref> NetworkGraph<L> where L::Target: Logger {
 
-		self.channel_failed_with_time(short_channel_id, is_permanent, current_time_unix)
+		self.channel_failed_permanent_with_time(short_channel_id, current_time_unix)
 	}
 
-	/// Marks a channel in the graph as failed if a corresponding HTLC fail was sent.
+	/// Marks a channel in the graph as failed permanently.
 	///
-	/// If permanent, removes a channel from the local storage.
-	/// May cause the removal of nodes too, if this was their last channel.
-	///
-	/// If not permanent, no action is taken as such a failure likely indicates the node simply
-	/// lacked liquidity and your scorer should handle this instead.
-	fn channel_failed_with_time(&self, short_channel_id: u64, is_permanent: bool, current_time_unix: Option<u64>) {
+	/// The channel and any node for which this was their last channel are removed from the graph.
+	fn channel_failed_permanent_with_time(&self, short_channel_id: u64, current_time_unix: Option<u64>) {
 		let mut channels = self.channels.write().unwrap();
-		if is_permanent {
-			if let Some(chan) = channels.remove(&short_channel_id) {
-				let mut nodes = self.nodes.write().unwrap();
-				self.removed_channels.lock().unwrap().insert(short_channel_id, current_time_unix);
-				Self::remove_channel_in_nodes(&mut nodes, &chan, short_channel_id);
-			}
+		if let Some(chan) = channels.remove(&short_channel_id) {
+			let mut nodes = self.nodes.write().unwrap();
+			self.removed_channels.lock().unwrap().insert(short_channel_id, current_time_unix);
+			Self::remove_channel_in_nodes(&mut nodes, &chan, short_channel_id);
 		}
@@ -2600,3 +2591,3 @@ pub(crate) mod tests {
 			// and all of the entries will be tracked as removed.
-			network_graph.channel_failed_with_time(short_channel_id, true, Some(tracking_time));
+			network_graph.channel_failed_permanent_with_time(short_channel_id, Some(tracking_time));
 

@valentinewallace
Copy link
Contributor

There's a stray call to the previous method name in fuzzing

Previously, we were requiring any `UPDATE` onion errors to include
a `channel_update`, as the spec mandates[1]. If we see an onion
error which is missing one we treat it as a misbehaving node that
isn't behaving according to the spec and simply remove the node.

Sadly, it appears at least some versions of CLN are such nodes, and
opt to not include `channel_update` at all if they're returning a
`temporary_channel_failure`. This causes us to completely remove
CLN nodes from our graph after they fail to forward our HTLC.

While CLN is violating the spec here, there's not a lot of reason
to not allow it, so we go ahead and do so here, treating it simply
as any other failure by letting the scorer handle it.

[1] The spec says `Please note that the channel_update field is
mandatory in messages whose failure_code includes the UPDATE flag`
however doesn't repeat it in the requirements section so its not
crazy that someone missed it when implementing.
@TheBlueMatt
Copy link
Collaborator Author

Fixed two stray refs:

$ git diff-tree -U1 a22227bf 67ad6c40f
diff --git a/fuzz/src/router.rs b/fuzz/src/router.rs
index 568dcdf02..fe6f1647f 100644
--- a/fuzz/src/router.rs
+++ b/fuzz/src/router.rs
@@ -229,3 +229,3 @@ pub fn do_test<Out: test_logger::Output>(data: &[u8], out: Out) {
 				let short_channel_id = slice_to_be64(get_slice!(8));
-				net_graph.channel_failed(short_channel_id, false);
+				net_graph.channel_failed_permanent(short_channel_id);
 			},
diff --git a/lightning/src/routing/gossip.rs b/lightning/src/routing/gossip.rs
index cc256b167..e5f5e63c9 100644
--- a/lightning/src/routing/gossip.rs
+++ b/lightning/src/routing/gossip.rs
@@ -2624,3 +2624,3 @@ pub(crate) mod tests {
 			// and all of the entries will be tracked as removed.
-			network_graph.channel_failed(short_channel_id, true);
+			network_graph.channel_failed_permanent(short_channel_id);
 

@TheBlueMatt TheBlueMatt merged commit c89fd38 into lightningdevkit:main Apr 24, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants