Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last sweep tx is constantly rebroadcasted on startup #7708

Closed
roeierez opened this issue May 18, 2023 · 25 comments · Fixed by #7879
Closed

Last sweep tx is constantly rebroadcasted on startup #7708

roeierez opened this issue May 18, 2023 · 25 comments · Fixed by #7879
Assignees
Labels
bug Unintended code behaviour force closes P1 MUST be fixed or reviewed utxo sweeping
Milestone

Comments

@roeierez
Copy link
Contributor

Background

As a follow up to this #7599 (comment)
The last sweep tx is persisted and always rebraodcated on startup. In case this tx double spends another tx it causes the user balance to stay in unconfirmed state.

Your environment

  • LND v15.3
  • mobile android/iOS using neutrino

Steps to reproduce

We have seen this in support cases on closed channels.

Expected behaviour

We expect the double spent transaction to be removed and not constitently being rebroadcasted.

Actual behaviour

Sweep tx always rebroadcasted.

In general since Btcwallet has a mechnism to rebroadcast unmined transactions so do you see any reason to persist the last sweep tx and republish it at startup by the sweeper? https://github.com/lightningnetwork/lnd/blob/master/sweep/sweeper.go#L344

@roeierez roeierez added bug Unintended code behaviour needs triage labels May 18, 2023
@guggero
Copy link
Collaborator

guggero commented May 19, 2023

IIUC was fixed in 0.16.2, see #7448, #7599

@Roasbeef
Copy link
Member

@guggero this is different, the sweeper will always try to rebroadcast the last sweep transaction each time. It doesn't need to though, as the wallet will pick that up, and any other sweeps are assumed to be re-offered on start up.

@Roasbeef
Copy link
Member

What we need to do is actually start to remove the last week txn from the sweeper's DB, and also remove it here if/when we get that error:

lnd/sweep/sweeper.go

Lines 343 to 365 in bbbf7d3

// Retrieve last published tx from database.
lastTx, err := s.cfg.Store.GetLastPublishedTx()
if err != nil {
return fmt.Errorf("get last published tx: %v", err)
}
// Republish in case the previous call crashed lnd. We don't care about
// the return value, because inputs will be re-offered and retried
// anyway. The only reason we republish here is to prevent the corner
// case where lnd goes into a restart loop because of a crashing publish
// tx where we keep deriving new output script. By publishing and
// possibly crashing already now, we haven't derived a new output script
// yet.
if lastTx != nil {
log.Debugf("Publishing last tx %v", lastTx.TxHash())
// Error can be ignored. Because we are starting up, there are
// no pending inputs to update based on the publish result.
err := s.cfg.Wallet.PublishTransaction(lastTx, "")
if err != nil && err != lnwallet.ErrDoubleSpend {
log.Errorf("last tx publish: %v", err)
}
}

As is, it tries to make minimal assumptions about what will actually be re-broadcasted, but we have a rebroadcast assumption at a higher level now.

@Roasbeef
Copy link
Member

We also never delete any of the past sweep txid's as well. This is safe to do as we only need them to dtect our own sweeps, but after we move a conflict, then we can remove them.

@roeierez
Copy link
Contributor Author

Any update on this? We keep getting support cases from our (Breez) users that are unable to spend their funds.

@saubyk
Copy link
Collaborator

saubyk commented Jul 24, 2023

@Roasbeef does the pr #7746 help address this issue?

@ziggie1984
Copy link
Collaborator

@Roasbeef does the pr #7746 help address this issue?

No it does not.

I try give a short overview.

What is important to point out, that this only is a problem for neutrino nodes. Normal nodes will remove this double spend tx immediately not blocking any funds.
This ties to #7800 as well.

Basic problem is that somehow the Bitcoind Backend does not send any failure in case we are broadcasting a double spend when using the BIP157 I think. This is one happens to the bitcoind backend:

2023-07-24T08:24:12Z [net] got inv: witness-tx e111cd8f28b4fa2a0439baca5585f442b5b32674bd15c183404fb9a5889a77b5  new peer=14
2023-07-24T08:24:14Z [net] Requesting tx e111cd8f28b4fa2a0439baca5585f442b5b32674bd15c183404fb9a5889a77b5 peer=14
2023-07-24T08:24:14Z [mempool] stored orphan tx e111cd8f28b4fa2a0439baca5585f442b5b32674bd15c183404fb9a5889a77b5 (mapsz 1 outsz 2)
2023-07-24T08:24:14Z [mempoolrej] e111cd8f28b4fa2a0439baca5585f442b5b32674bd15c183404fb9a5889a77b5 from peer=14 was not accepted: bad-txns-inputs-missingorspent

It will not report an error because it first stores it as orphan therefore not replying with a reject message to the lnd node. And only later not accepting it but this is never send to the client (the lnd node) did something change here in the bitcoind backend. I have the feeling that all transactions are accepted by the bitcoind backend now with the current design and nothing gets rejected at all?

Solving this problem includes 2 fixes:

We could just not republish the last transaction as referenced in this issue here:
https://github.com/lightningnetwork/lnd/blob/master/sweep/sweeper.go#L344

This would not completely fix the issue because part of the transactions are already in the "unminded tx-store" see explanation in #7800.

So we need both parts of the solution. If nobody is already working on this one, I can do that.

Still two questions I have:

  1. Is the behaviour of bitcoind correct here in regards of BIP157. Why is bitcoind not rejecting this transactions in the first place (Will test with btcd later and see whether it reports correct)

  2. Before removing the last-tx broadcasting I need to understand this comment by Joost referenced here:

    lnd/sweep/sweeper.go

    Lines 349 to 355 in bbbf7d3

    // Republish in case the previous call crashed lnd. We don't care about
    // the return value, because inputs will be re-offered and retried
    // anyway. The only reason we republish here is to prevent the corner
    // case where lnd goes into a restart loop because of a crashing publish
    // tx where we keep deriving new output script. By publishing and
    // possibly crashing already now, we haven't derived a new output script
    // yet.

    I do not really understand this corner case, so I hesitate to remove this behaviour in the first place.

@ziggie1984
Copy link
Collaborator

Update regarding BTCD as a backend for the BIP157 connection:

When using btcd instead of bitcoind the double spend is recognized and the transaction removed, so it could be a short term fix to connect all breez wallets to btcd backends ? @roeierez how do breez clients decide which BIP157 peer they choose, is it hard coded or based on discovery ?

@kingonly
Copy link

kingonly commented Jul 25, 2023

Update regarding BTCD as a backend for the BIP157 connection:

When using btcd instead of bitcoind the double spend is recognized and the transaction removed, so it could be a short term fix to connect all breez wallets to btcd backends ? @roeierez how do breez clients decide which BIP157 peer they choose, is it hard coded or based on discovery ?

No discovery. We have a configuration that users can change https://doc.breez.technology/Connecting-a-Bitcoin-node-supporting-BIP-157.html

@Roasbeef
Copy link
Member

Basic problem is that somehow the Bitcoind Backend does not send any failure in case we are broadcasting a double spend when using the BIP157 I think. This is one happens to the bitcoind backend:

Yep...they removed the reject message unfortunately, so light clients get no feedback at all when transactions aren't accepted. btcd still has the reject message, but it's a minority client, so it doesn't help as much unless most of your peers are also sending the message.

@ziggie1984
Copy link
Collaborator

Yep...they removed the reject message unfortunately, so light clients get no feedback at all when transactions aren't accepted. btcd still has the reject message, but it's a minority client, so it doesn't help as much unless most of your peers are also sending the message.

Our sweeper logic which is not state aware relied in a way on this rejecting behaviour.
Hmm this has some potential side effects for our sweeper when using neutrino as a backend. Because our sweeper just keeps batching outputs (although already in mempool/not state aware) and broadcasting them would block more wallet outpoints (if enough funds are available) on the way (in the worst case) until the output is mined. Can we live with this behaviour, because my proposed fix would not touch this exact case 🤔

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Jul 28, 2023

We should address this issue, for now we cannot completely solve this behaviour (without a big design change of the sweeper) because we are not getting the reject msg back into core (don't know the reason for their decision tho.). But I think we could mitigate some of the bad side effects this behaviour causes for neutrino nodes.

As we all know now, this problem blocks utxos from the lnd default wallet because invalid transaction are not recognized as invalid by our neutrino backend anymore (no reject msg). This is the main problem but we can mitigate the "how long" of such bad behaviour. We will not able to completely avoid the false broadcasting and hence blocking of utxos but we can limit them in time.

In the following I will outline 3 cases where this behaviour effects lnd nodes and describe how my proposed change could mitigate them.

  1. The first problem which was the reason for this issue is that we are going to publish the last sweep transaction at startup. This can be a transaction which double spends an already resolved contract leading to funds being blocked. And they will only be released if our sweeper gets a sweep request for another input (maybe because of another force-close) which would replace the last sweep transaction with a new one releasing this blocked wallet utxo (double spend sweep transaction) after a restart only.
    We should definitely take care of this, and don't broadcast the last sweep tx during startup because it might be outdated.

  2. Second case relates to a force-close during which we have maybe 1 outgoing timeout sweep and an anchor sweep as well (2 separate sweeps because anchor sweeps are not batched). Now lnd will broadcast the 2 sweep transactions blocking one utxo input for each sweep (best case if it can pay for the fees and largest strategy is used). Now if the user has to startup the node again, the sweeper will try to rebroadcast those 2 sweeps again BUT it will use 2 new default lnd wallet inputs this time to basically create 2 new sweep transactions. So now 4 wallet utxos are blocked. Each of the swept inputs (anchor and outgoing htlc) has now 2 corresponding sweep transactions. But in reality only 2 transactions made it into the mempool the other two were rejected by the bitcoin core nodes.
    Currently we do not remove any descendant sweep transaction in case the 2 transactions in mempool are confirmed. In this situation we will have 2 inputs blocked indefinitely. We can do better here and just remove all descendants when one transaction with the same input is confirmed.
    We would need to remove the !isOurs clause here:
    https://github.com/lightningnetwork/lnd/blob/master/sweep/sweeper.go#L712-L728 as suggested by roasbeef in my other PR. This would basically clean up the complete state in case one of the inputs is confirmed, removing all other sweeps which spent the same input. Referring to my statement above, we do not solve the problem but we mitigate the time this burdens us and because sweeps are generally confirmed in the next 3-6 blocks this is a good strategy IMO.

  3. The third case relates to neutrino remove sweeptx #7800. So when we have anchor channels (almost always), we will always broadcast 2 anchor CPFP transactions. Because we do not know whether our peer already force-closed his channel we broadcast the CPFP for our local commitment tx BUT also for the remote commitment tx, in case our PEER already force-closed the channel and the remote commitment tx is in our mempool. Now comes the problem. Almost always will we force-close locally and our peer did not broadcast any commitment transaction only we did. We will however broadcasts the CPFP for the remote commitment as well (because we cannot be 100% sure). This transaction will not be rejected by the neutrino backend. We have to make sure we remove this CPFP tx as soon as our local commitment tx is confirmed (or the anchor CPFP which normally happens in the same block). This is proposed in my references PR. We have to remove this invalid tx hence releasing the blocked utxo as soon as the local anchor CPFP is confirmed which happens really fast (1-3 blocks).

Those Scenarios need to be fixed all at once because the can influence each other as well so we cannot just fix one case.

That being said, I think it is worth to mitigate this behaviour and in the meantime think of a better solution which does not bring us in this situation for neutrino wallets at all?

Happy to hear your thoughts and questions :)

@Roasbeef
Copy link
Member

Roasbeef commented Aug 2, 2023

because invalid transaction are not recognized as invalid by our neutrino backend anymore (no reject msg)

One question I have is: which class of invalid transactions are we missing? If it's an input already being spent, or one that has been double spent, then we should be detecting these as we register for spend notifications for each input we try to sweep. One known missing gap is this btcwallet PR which may help to resolve some of the issues re double spend anchor outputs.

This can be a transaction which double spends an already resolved contract leading to funds being blocked.

How does this lead to funds being blocked if the contract has already been resolved? Assuming the same set of inputs are used, then contract resolution would then imply that those UTXO no longer exist.

We should definitely take care of this, and don't broadcast the last sweep tx during startup because it might be outdated.

Agree that this is straightforward enough. The design of the sweeper is that all callers will re-offer their inputs, most cases are covered, other than maybe manual BumpFee calls for people trying to do CPFP.

Now if the user has to startup the node again, the sweeper will try to rebroadcast those 2 sweeps again BUT it will use 2 new default lnd wallet inputs this time to basically create 2 new sweep transactions

On startup, a canned transaction is always used, or I think you mean that eventually the inputs will be re-offered leading to another fresh sweep attempt?

We can do better here and just remove all descendants when one transaction with the same input is confirmed.

One thing I'm not following is why exactly people have some many dependent transactions from a sweep. Are the major neutrino-based mobile wallets all spending unconfirmed change constantly?

Also for this series of issues, I think we should try to find and resolve a fundamental issue as far down in the stack as possible. As an example, is there missing btcwallet logic here to just always act as if we may only partially control. Eg, this removeConflict routine that should handle cases where we use an input that gets confirmed elsewhere.

I wonder if an even simpler temporary mitigation (see comment above about trying to fix this on the most fundamental layer) would just be: don't try to sweep anchor UTXOs for non fee bumping scenarios for neutrino nodes. At the default fee rate, the anchors should themselves not actually be positive yield, so they should just sit there. IIRC, we actually skip some anchor itests for neutrino for this very reason: they don't detect spends of it in the mempool, and end up holding onto the UTXOs longer (we give the neutrino node more UTXOs to work with so the test is workable).

@ziggie1984
Copy link
Collaborator

Thank you for your insights, I will think about how I can implement a fix on the lower level as you mentioned 👍

Will update my comment here as soon as I have a detailed plan how to proceed with further implementation.

@ProofOfKeags
Copy link
Collaborator

Hi Ziggie,

I'm looking into this to provide guidance on the overall strategy as well, but I did find an answer to this question you asked above:

  1. Before removing the last-tx broadcasting I need to understand this comment by Joost referenced here:
    ...
    I do not really understand this corner case, so I hesitate to remove this behaviour in the first place.

Looking into the code it seems that we usually publish the tx in the sweep call. The issue here is that towards the beginning of that method we generate a new output script (address) to which the funds are ultimately swept. It appears that simply calling this mutates the internal state of the wallet. So in this case if we get into a crash loop we will keep generating new output scripts and in the case where one of the much later ones confirms we may dump money into an address that is significantly passed the Address Gap Limit and will be undiscoverable when restoring from seed.

I'm still investigating potential options of what to do regarding your 3 point plan, but I figured I'd relay this in the mean time.

@ziggie1984
Copy link
Collaborator

Thank you @ProofOfKeags really appreciate your thoughts on this. I think you are right with your investigation. Seems like this corner case needs to be investigated whether we can remove it. Before we remove it, we need to be 100% sure the Publishtransaction function does not crash and every error case is caught I think ? Otherwise we might risk losing funds in a very unlikely case ? I need to check in more detail tomorrow whether we can somehow prevent running into this case where we create a lot of addresses, maybe adding a burst limit or another security check 🤔

@ProofOfKeags
Copy link
Collaborator

Alright I think I am starting to grasp the scope of this. First off, thank you for the rigorous analysis @ziggie1984.

The core problem seems to stem from the fact that we do not have a timely reliable way to detect whether a transaction is rejected by the network with neutrino. However, we do have a reliable way to detect if an output we are currently attempting to sweep is spent by another transaction, since BIP158 states that every output script of every input is a member of the filter. This is ultimately impossible to solve in a timely manner since bitcoind removed the reject message, the only way we can determine with certainty that the transaction was rejected is if a conflicting transaction is mined. Correspondingly, the only way we have to determine that it was accepted is if the transaction we expect is eventually mined. On average this will be 10minutes away and can be significantly longer than that.

Secondly, as referenced in my previous comment, the reason that we indiscriminately broadcast the last sweep is to avoid creating too much empty space in the wallet tree and overrunning the AGL. As a general rule this means that the main thing we need to ensure is:

sweep idempotence: if we have attempted to sweep an input in the past, any attempt to republish it ought to use the same transaction

If this invariant is guaranteed by the sweeper, we no longer need to use the blunt tool of always republishing in the startup sequence (which, by my analysis is a hack to sidestep the AGL issue).

This change would still need to be compatible with us abandoning the sweep process for that input if any conflicting spend is mined. This is something we should be able to ascertain from the neutrino backend since all inputs are in the filter set. By my estimation, this part is actually already guaranteed by the sweeper for remote parties, and can be improved by implementing the fix you provide in point 2. However, as you point out, simply doing what you say in point 2 is not enough to solve the core problem. It merely allows LND to heal itself eventually. Not the worst, but still not a real solution IMO.

Now that leaves point 3, which I think is rather thorny and where I would like to lean on @Roasbeef to correct me if any of the following intuitions overlook important issues. From what I can tell, proactively broadcasting anchor spends in non-fee bumping scenarios seems superfluous. Anchors exist for the sole purpose of ensuring timely confirmation of transactions we would like to see confirmed. As far as I can tell, any remote commitment transaction is preferred to the local commitment for 2 reasons. First, revoked remote commitments give us the chance to hit our peer with the justice transaction which is always the most favorable outcome for us. Second, the remote commitment for the current state is preferable because it grants us access to our funds more quickly. With this in mind, since we should be able to compute the exact structure of any remote commitment transaction for the lifetime of the channel, we should be able to know for sure whether any of them are in the mempool so long as we can get our hands on the mempool............BUT, this is where things get tricky because my understanding is that neutrino offers us no ability to get at the mempool.

The trouble with the approach of always CPFP'ing the anchor of the remote commitment is that it is barely more sensible to do that for the current remote commitment than it is do do it for all remote commitments for the lifetime of the channel. Game theoretically the old ones should never be broadcast and so preemptively ensuring their confirmation is very unlikely to bear fruit. The current remote commitment is certainly possible and the only downside to having our own commitment confirm instead of the remote is the lockup period. If we are OK with letting the suboptimal nature of our commitment confirming in the presence of a remote commitment that is broadcast, then we should be able to ditch the "blind broadcast" logic of the remote commitment anchor CPFP transaction entirely. Given that our peer is incentivized the same way in the mirror image scenario means that even our attempt to CPFP the remote is less likely to bear fruit than it would appear at first.

All of this points to the idea that we should drop the behavior of always CPFP'ing the remote commitment of the current state entirely. This is a really long winded way of agreeing with @Roasbeef's assessment in his last paragraph but with a lot more justification for why. However, I think we can and ought to go further than his suggestion, because making this depend on which backend we use bears a complexity cost to LND that would be annoying to maintain.

So unless there is strong disagreement, and I do think that I need an ACK from @Roasbeef on this, I think the approach is as follows:

  1. make sweeps of a given input idempotent.
  2. Rebroadcast the way we do in the main loop now, even for the most recent sweep
  3. Ditch the sweep rebroadcast in the startup sequence
  4. Drop the logic for CPFP'ing the remote commitment's local anchor

I believe this solves all 3 problems you outline above.

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Aug 6, 2023

Thank you for your ideas 🙏, I will present my game plan for fixing this in the following:

Making the sweeper idempotent could fail situations where we want to bump the fee of an input. An input could also slide in a higher feegroup over time because the fee-estimation is based on block-targets. My current solution for this would be, as roasbeef suggested instead of just broadcasting the latest sweep transaction we also register each input of that transaction with our Chain-Notifier. The only thing we need to add into lnd to trigger the removal of conflicting transactions, which would lead us to remove the last sweep transaction eventually. Wdyt ? With this approach we could keep the broadcasting and make sure we do not end up in address inflation ?
I evaluated the strategy to include it into the btcwallet but I think we should leverage the ability of our chainnotifier which allows us to start rescan jobs for different SpendRequests.

Regarding the ditching of the CPFP logic. I think roasbeef was talking about the sweeping of anchors when the Commitment Tx is already confirmed. That I agree we should make that configurable and switch in off by default.
Problem with ditching the logic for the remote logic when its being used for the CPFP is: Imagine a scenario where your peer force-closed on you, but for whatever reason he does not CPFP the "remote tx". Now you want to fee-bump it because you do not want to wait any longer. Now what you could do currently, force-close the channel on your side. Now your local commitment tx is rejected because the remote is in your mempool. With the current logic, you still could feebump the anchor you have unconfirmed which now resulted from trying to CPFP the remote tx.
So removing the logic is a bad idea. My plan is to add an ancestor field for every input which will also be considered when its spend. Now our chain-notifier will ping lnd if this ancestor is spent and we just remove any conflicting transactions?

Will have a prototype by tomorrow so that you can have a look how I envision it from the code side.

The plan is to split this change into 2 PRs:

  1. First PR fixes the urgent issue of the last_sweep and also the ability to register for ancestors.
  2. Second PR will make the Anchor Sweep after the Commitment Tx is confirmed configurable (with a default of switching it off).

@Roasbeef
Copy link
Member

Roasbeef commented Aug 8, 2023

make sweeps of a given input idempotent.

I think this would only be possible by adding more state to the sweeper (basically the entire input set that would be swept along with it). The original design of the sweeper was to be more or less stateless, and continually try to replace/bump in the background until something confirmed, then revise the sweeps and do it all over again. If we want to fully bind all the inputs, eg: an input can only ever be spent along side its cohorts, then we'd need to persistently track all the various sweeps (including RBF bumps, etc).

The other issue with this goal is that we don't always control all the spends paths of the inputs we're trying to sweep. Example include: 3rd parties spending anchors, breaches, remote party sweeping with preimage when we go to time out the HTLC, etc. If we fully bound the sweep set, then naively the inputs that were batched along with that HTLC output couldn't be published again (most strict interpretation).

With the current logic, you still could feebump the anchor you have unconfirmed which now resulted from trying to CPFP the remote tx.

Agree that for cases where live HTLCs are involved (mainly routing nodes), we need to also be able to bump the fee of the remote party's commitment transaction. This is made more difficult by the fact hat we don't actually know if theirs is in the mempool or not (sans polling our mempool to see if it's there), which resulted in the sort of "blind bumping" strategy we have today.

@ProofOfKeags
Copy link
Collaborator

@ziggie1984

I think roasbeef was talking about the sweeping of anchors when the Commitment Tx is already confirmed.

If this was the case we don't need to sweep the anchors of the local commitment because it is necessarily impossible for both the local and remote commitment to confirm. So either we should be able to sidestep the issue of 2 mutually conflicting anchor sweeps by waiting until confirmation of one such commitment. Alternatively, we have to consider the pre-confirmation scenario which is what the main purpose of the anchors are to begin with, although we may classify them as CPFP instead of "sweeping".

My current solution for this would be, as roasbeef suggested instead of just broadcasting the latest sweep transaction we also register each input of that transaction with our Chain-Notifier. The only thing we need to add into lnd to trigger the removal of conflicting transactions, which would lead us to remove the last sweep transaction eventually.

As far as I understand the ChainNotifier state doesn't persist across restarts of LND. I believe the sweeping logic must be able to tolerate interruption and restarting of the daemon. Alternatively we need to persist enough information to be able to re-register for those notifications.

So removing the logic is a bad idea.

I was unclear in my previous communication. I mean dropping this logic as it pertains to automatically sweeping local anchors for both local and remote commitment txs. We have to keep the CPFP logic for explicit fee bumping scenarios, but should not do so automatically I don't think.

@Roasbeef

I think this would only be possible by adding more state to the sweeper (basically the entire input set that would be swept along with it). The original design of the sweeper was to be more or less stateless, and continually try to replace/bump in the background until something confirmed, then revise the sweeps and do it all over again.

Yes, it would require more state tracking than it currently has. If the current sweeper was designed to be stateless was that a desire or a requirement? If it is a requirement, what is forcing the statelessness here?

If we fully bound the sweep set, then naively the inputs that were batched along with that HTLC output couldn't be published again (most strict interpretation).

I see, I had failed to consider the batching mechanic. This is true under my proposal. However, I think the part that was missed is that the sweep idempotence should be "reset" if an input used in that sweep is spent by a different (confirmed) transaction. The idea is to sidestep the scenario where we conflict with ourselves and therefore locking up inputs until we detect that conflict. However, while we are awaiting the confirmation of some transaction that will spend any of these outputs, there's no reason we should choose to broadcast a conflicting transaction, except in the case of explicit fee bumping.

@ziggie1984
Copy link
Collaborator

I was unclear in my previous communication. I mean dropping this logic as it pertains to automatically sweeping local anchors for both local and remote commitment txs. We have to keep the CPFP logic for explicit fee bumping scenarios, but should not do so automatically I don't think.

Hmm I think we need to keep the auto-feature as well. As roasbeef stated, especially for routing nodes we can end up in a situation where we had an Outgoing HTLC which is timing out but the remote peer already force closed prior to our HTLC timing out and still dangling in the mempool (ours as well because of low fees because he had no urgent need to bump it). Now blocks pass by and eventually our HTLC hits the timeout, now the CPFP needs to happen automatically without user intervention because its time-critical here.

As far as I understand the ChainNotifier state doesn't persist across restarts of LND. I believe the sweeping logic must be able to tolerate interruption and restarting of the daemon. Alternatively we need to persist enough information to be able to re-register for those notifications.

Exactly, I am currently saving all the related input information in an additional subbucket to re-register them after a restart or something similar.

However, while we are awaiting the confirmation of some transaction that will spend any of these outputs, there's no reason we should choose to broadcast a conflicting transaction, except in the case of explicit fee bumping.

Agree we should not broadcast sweeps we know beforehand that they will be rejected by our backend. But we need to balance more complexity just for neutrino backends vs maybe the solution where broadcast an additional sweep but resolving it when the related output is spent after a couple of minutes/hours ?

@yyforyongyu
Copy link
Collaborator

We should definitely take care of this, and don't broadcast the last sweep tx during startup because it might be outdated.

I think we can instead remove the republishing last tx logic. At first glance it's hacky, and it just hopes by the time the republish fails, lnd has already reached its crashing point, which can be false.

If it does crash, we should instead investigate the reason but not restart it. Plus during restart, sweeper won't create any transactions unless there are pending inputs being sent and the batch time ticker has fired.

Now if the user has to startup the node again, the sweeper will try to rebroadcast those 2 sweeps again BUT it will use 2 new default lnd wallet inputs this time to basically create 2 new sweep transactions.

Note that the whole tx is serialized and saved to disk. During restart, there would only be one transaction being republished.

Overall I think the rebroadcasting logic in PublishTransaction needs to be improved - we shouldn't rebroadcast it unconditionally. In fact, I think republishing the same tx won't have any effects for bitcoind as it won't get relayed by peers anyway, and the rebroadcaster's responsibility should be only keeping the tx in the mempool.

So back to this issue, I think we can limit the scope and look into how we can remove the republishing last tx logic.

@ziggie1984
Copy link
Collaborator

So back to this issue, I think we can limit the scope and look into how we can remove the republishing last tx logic.

Sounds good, I realized while working on a prototype that the scope can easily become to big when fixing everything at once. So I am happy to first remove the lastsweep tx logic, because I agree it feels hacky and trying to even consider registering a notification for each input of the last sweep would increase the hacky factor as well. So I will just remove this logic first. Let me know if somebody has any concerns with this approach ;)

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Aug 11, 2023

Potential solution from Joost (when we decide to remove the last sweep tx logic to prevent address inflation):

what about creating a fixed sweep script on first startup and reusing that forever? maybe not according to the best privacy guidelines, but it seems to be a safe, temporary fix

@ProofOfKeags
Copy link
Collaborator

ProofOfKeags commented Aug 11, 2023

I'd rather leak privacy than risk fund loss. Do it.

@saubyk saubyk added this to the v0.17.0 milestone Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour force closes P1 MUST be fixed or reviewed utxo sweeping
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

8 participants