Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel ERROR: "failing link: unable to resolve fwd pkgs: bucket not found with error: internal error" #6593

Closed
zerofeerouting opened this issue May 30, 2022 · 22 comments · Fixed by #6642
Assignees
Labels
bug Unintended code behaviour database Related to the database/storage of LND htlcswitch P1 MUST be fixed or reviewed

Comments

@zerofeerouting
Copy link

zerofeerouting commented May 30, 2022

Background

I run a CLN node and have experienced quite a couple of instances where my node force-closed a channel, due to the LND peer sending an internal error message.

I finally had this error with a peer that was able to provide the relevant logs (@ZoltanAB)

LND environment

  • LND: 0.14.2-beta
  • OS: Linux ipayblue-1 5.10.0-13-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux
  • Using @C-Otto's rebalance-lnd script
    (if that's relevant)

Steps to reproduce

Have a channel between LND / CLN that forwards HTLCs.

Expected behaviour

LND should not send an error.

Actual behaviour

LND sends an error.

Logs

LND Logs (peer A)

2022-05-29 21:46:19.294 [ERR] HSWC: ChannelLink(297f43e0ac9a7307f334dc2a38eac05a86943f77e912dba679bc9cda52284a55:0): unable to remove fwd pkg for height=421027: bucket not found
2022-05-29 21:46:19.294 [ERR] HSWC: ChannelLink(297f43e0ac9a7307f334dc2a38eac05a86943f77e912dba679bc9cda52284a55:0): failing link: unable to resolve fwd pkgs: bucket not found with error: internal error

CLN logs (peer B)

2022-05-29T21:46:12.082Z UNUSUAL 032fe854a231aeb2357523ee6ca263ae04ce53eee8a13767ecbb911b69fefd8ace-channeld-chan#7100: Adding HTLC 2358 too slow: killing connection
2022-05-29T21:46:12.084Z INFO    032fe854a231aeb2357523ee6ca263ae04ce53eee8a13767ecbb911b69fefd8ace-chan#7100: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (9)
2022-05-29T21:46:20.650Z UNUSUAL 032fe854a231aeb2357523ee6ca263ae04ce53eee8a13767ecbb911b69fefd8ace-chan#7100: Peer permanent failure in CHANNELD_NORMAL: channeld: received ERROR error channel 554a2852da9cbc79a6db12e9773f94865ac0ea382adc34f307739aace0437f29: internal error
2022-05-29T21:46:20.651Z INFO    032fe854a231aeb2357523ee6ca263ae04ce53eee8a13767ecbb911b69fefd8ace-chan#7100: State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL

Additional info

The LND node was heavily rebalancing and thus running into memory issues about 7 minutes before the event (no log entries up to 2022-05-29 21:40:02.124).

index

As you can tell from the graph, they stopped their rebalancing script a couple of hours after the crash.

@ZoltanAB
Copy link

The rebalancing scripts were working fine. I guess the main issue was trying to run too many instances in the same time, this killed my system (out of memory).

@indomitorum
Copy link

indomitorum commented May 30, 2022

I run LND 14.2. and yesterday, I opened a channel with Bcash who I'm told runs CLN. This got force closed remotely this morning.

Looked at the logs. This stands out :

ChannelLink(2a4829c56e036b97422be0c61d5a6d926a47317 86650a4b00ba55f1cc71a98b7:2): failing link: unable to update commitment: cannot add duplicate

ChannelLink(2a4829c56e036b97422be0c61d5a6d926a4731786650a4b00ba55f1cc71a98b7:2): failing link: unable to complete dance with error: remote unresponsive

[ERR] HSWC: ChannelLink(2a4829c56e036b97422be0c61d5a6d926a4731786650a4b00ba55f1cc71a98b7:2): failing link: unable to synchronize channel states: first message sent to sync should be ChannelReestablish, instead received: *lnwire.Error with error: unable to resume channel, recovery required

[ERR] HSWC: ChannelLink(2a4829c56e036b97422be0c61d5a6d926a4731786650a4b00ba55f1cc71a98b7:2): failing link: unable to update commitment: cannot add duplicate keystone with error: internal error

https://pastebin.com/42BqqhbW

@zerofeerouting
Copy link
Author

@indomitorum Your log looks like it's related to @C-Otto's issue, which should be resolved #6485

@Crypt-iQ Crypt-iQ added bug Unintended code behaviour crash labels May 30, 2022
@Roasbeef
Copy link
Member

Is this the same issue as #6485? If so, it'll be resolved in 0.15.

@zerofeerouting
Copy link
Author

The reason for the error message seems to be something different in this case: failing link: unable to resolve fwd pkgs: bucket not found with error: internal error. If the resulting behaviour is fixed either way, we can close the issue.

@Crypt-iQ
Copy link
Collaborator

This is not the same issue

@Roasbeef Roasbeef added database Related to the database/storage of LND htlcswitch labels May 30, 2022
@Crypt-iQ Crypt-iQ self-assigned this Jun 6, 2022
@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Jun 6, 2022

@ZoltanAB do you have more logs for this channel for several minutes before and after the above error? When did the node OOM? Relevant log categories would be HSWC, PEER, LNWL, CHDB.

@ZoltanAB
Copy link

ZoltanAB commented Jun 7, 2022

@ZoltanAB do you have more logs for this channel for several minutes before and after the above error? When did the node OOM? Relevant log categories would be HSWC, PEER, LNWL, CHDB.

Does the log file contain any sensitive information? If not, I could send you the log file around that date and hour. Please advise. Thank you.

@zerofeerouting
Copy link
Author

Just an info from me regarding the severity of this issue:

I have had 12 force closes due to this issue in the last seven days. That's a little more than 1% of my channels.

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Jun 7, 2022

@ZoltanAB do you have more logs for this channel for several minutes before and after the above error? When did the node OOM? Relevant log categories would be HSWC, PEER, LNWL, CHDB.

Does the log file contain any sensitive information? If not, I could send you the log file around that date and hour. Please advise. Thank you.

It contains privacy-leaking information (channel points, etc) - which I don't need if you want to redact them out. I am eugene on the lnd slack

@zerofeerouting
Copy link
Author

Thank you for looking into this @Crypt-iQ.

@Crypt-iQ Crypt-iQ added the P1 MUST be fixed or reviewed label Jun 7, 2022
@ZoltanAB
Copy link

ZoltanAB commented Jun 7, 2022

@Crypt-iQ can I use your email address (el.....l@gmail.com) to send you the generated log files?

@ZoltanAB
Copy link

ZoltanAB commented Jun 7, 2022

And FYI, today I had another similar FC around 02:10 AM GMT. Here is a graph of my load on the server for the last 24 hours:
https://gyazo.com/fb06514b54cbdbff5b9784f207667698
I can't see anything unusual, the load seems to be quite constant.

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Jun 7, 2022

@Crypt-iQ can I use your email address (el.....l@gmail.com) to send you the generated log files?

yup

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Jun 7, 2022

And FYI, today I had another similar FC around 02:10 AM GMT. Here is a graph of my load on the server for the last 24 hours: https://gyazo.com/fb06514b54cbdbff5b9784f207667698 I can't see anything unusual, the load seems to be quite constant.

Did you get the same bucket not found log message?

@ZoltanAB
Copy link

ZoltanAB commented Jun 7, 2022

Yes. Sending you now.

@ZoltanAB
Copy link

ZoltanAB commented Jun 7, 2022

Just sent you the logs. Thank you.

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Jun 8, 2022

Thanks for the logs, I know why this happens. I'll start working on a fix

@Crypt-iQ Crypt-iQ removed the crash label Jun 8, 2022
@ZoltanAB
Copy link

ZoltanAB commented Jun 8, 2022 via email

@ZoltanAB
Copy link

ZoltanAB commented Jun 13, 2022

@Crypt-iQ any update on this? Just had another FC due to this issue. Thank you, If needed, can send you more logs.

@Crypt-iQ
Copy link
Collaborator

@Crypt-iQ any update on this? Just had another FC due to this issue. Thank you, If needed, can send you more logs.

This won't get into 0.15 since that is right around the corner and I want the fix to receive review w/o being subject to a deadline. I could provide a patch this week that you could apply to your node if you are comfortable, but you'd need to revert it first when upgrading to any other version

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented Jun 14, 2022

preliminary fix is here #6642 - hopefully it survives review - it did fix my local repro case. I would recommend not patching this on your node until it receives adequate review or 0.15.1 is released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour database Related to the database/storage of LND htlcswitch P1 MUST be fixed or reviewed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants