-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LND restarts with over 1GB channel.db on 32bit linux (RaspberryPi 4) #4811
Comments
Do you have a breakdown of the file sizes in the entire |
Sure:
Actually I left the backup up of the channel.db (uncompacted.db) in there too. |
Having the As to your question about migrating the channel DB from 32bit to 64bit: This is something that I'd like to have a definitive answer to as well. In the past I've always recommended to not migrate between operating systems and/or architectures. For the affected nodes, I'd probably try compiling the |
LND failed again today (after I was able to gain some megabytes with compaction the last time) and now even
I have the 64bit ARM node ready, already restored a (small) database from a 32bit node. |
Went on to migrate the 1GB channel.db to a 64bit ARM system. Looking good so far. |
Oh oh, that sounds like an actual data corruption issue... Do you get the same error if you start that DB with |
I could not start the 1GB+ channel.db on the 32bit system at all (LND restarted immediately), so went on migrating it to the Could the |
Ah, okay. Yeah, it could be, what build/binary version of |
I built chantools from the source at v0.5.1 on the 32bit system specifying the environment: Testing It seems that using a 64bit system does solve this issue, but can be a returning question since many projects are based off the default 32bit Raspbian. |
That's good to know that a 32bit binary can result in a segfault for large DBs and that updating to 64bit helps. But that doesn't solve the problem for existing users, I agree. We added the |
Just to confirm to test the garbage collection one should:
Will the garbage collection and/or the compaction feature be on by default (happen on restarts) in lnd v0.12 or remains an option to be run? |
Yes, that's how I would try it as well. Though this might not do much if most data isn't from canceled invoices. It really depends on the usage of the node, but for now that is the only automatic garbage collection we've implemented (AFIK). Just to be clear: If you do this, you won't be able to downgrade the node to There is a new flag to run auto compaction on startup: |
Just had a report from another active node (not under my control) experiencing the same immediate restarts with LND:
and then with chantools (compiled for 32bit as described here: https://github.com/openoms/lightning-node-management/blob/master/LNDdatabaseCompaction.md):
He did not run |
That means the DB is too large to be opened with any 32bit process. If the user doesn't run a 64bit OS, they could copy it to a 64bit ARM device and run compaction there, with |
Yes, it sounds clear that there is no solution to this on 32bit ARM and I have just been lucky to be able to run compaction when the issue presented the first time. As he has no other ARM device advised to try the already functional 64bit RaspiBlitz version: raspiblitz/raspiblitz#1199 (comment). PR: raspiblitz/raspiblitz#1833 Thank you for the explanation about the upcoming features of LND: |
Another node reaching 1GB channel.db on 32bit ARM, just documenting the error message. Nothing left, but to migrate to 64 bit.
|
Update, as there is no other solution proposed here I am testing the 64bit Rasperry OS with the RaspiBlitz since opening the issue: The next (v1.7) SDcard release will be based on the 64bit base image and people can already download the RC1 from the dev branch: https://github.com/rootzoll/raspiblitz/tree/dev#downloading-the-software If a 32bit system has this problem starting with the 64bit image is tested to solve the issue and LND will start again. |
Closing this as the issue is with the 32-bit systems. |
Hey @Roasbeef sorry but I have to double check on this: Are you absolutely certain that you want to keep this closed and unfixed which would effectively mean that lnd does not support 32 bit systems in the future (with all the (un)intended consequences that may come with such a decision)? |
Two remarks:
I suppose it should either be fixed or documented on the main page that lnd explicitly doesn't support 32 bit systems and it's okay for it to crash as soon as the database reaches 1Gb (which isn't that much for a busy node), and you're left to pick up the pieces. Yes, the mainstream distros like Umbrel and Raspiblitz migrated to aarch64 but the very popular Raspbian (that I use) is still 32 bit by default and the 64 bit version is so unofficial/beta that I had to google it, there's no mention of it on the download page. I didn't even know it exists at all! |
@renepickhardt there's nothing to resolve on our end which is the reason why this issue is closed. This is a matter of the architecture that The reason why users run into this for 32-bit systems is entirely related to their kernel settings. bbolt uses memory mapping by default to map the entire database onto the virtual memory address space. Ignoring everything else, this means that the database size can grow up to 4 GB (not all physically mapped, but simply addressable). Once the DB size gets over 1 GB, bolt then attempts to re-size the memory map to attempt to double it to 2GB or so. At this point what happens next is dependent on the memory split of the user space vs the kernel. If kernel is set to (as an example) occupy 3 GB of space, with 1 GB for the user, then this operation will fail (example for 4GB of RAM some pis have less). In terms of the kernel here, if users only want to run 32-bit systems, then I believe activating the I wager that only very old or very large nodes running on pi's etc (that haven't examined their kernel settings at all), need to deal with this. For larger nodes, if a node is that larger then it would behove the user to switch to more reliable hardware as pi's notoriously can run into hardware failure issues. In practice, if a user ever detects this, then it's likely due to the fact that they may have never compacted their database. bolt doesn't actually reclaim the space when things are deleted, instead it keeps it all around on disk, putting the free pages in a free list for new DB operations. Users that run into this can usually just compact their database and the issue goes away. The latest versions of Beyond that, lnd 0.14 will ship with a newly optimized |
To provide a bit more detail as to why the operation can fail (lets assume 4 GB of RAM, DB is 2 GB at this point), see this method: https://github.com/etcd-io/bbolt/blob/master/node.go#L517 What happens is the bolt needs to copy the entire database into heap memory temporarily to ensure that once it unmaps, then maps the database, the inodes, etc aren't pointing to stale memory. I think this is the error most people are running into: when then DB needs to double in size initially (and any 1 GB increments beyond that) there simply isn't enough addressable physical memory. |
One attempt to mitigate this somewhat in the past was trying to set an initial memory map of the largest DB a 32-bit system can handle: btcsuite/btcwallet#697 This would mean that the DB never needs to be copied over as it would never need to be remapped. IIRC we fell short on testing on that, and also just generally concluded that in 2021 maybe the effort to make things slightly safer for 32-bit system may not have been worth it. Since then IIRC, we entertained removing the compiled 32-bit binaries, but then decided that for smaller nodes (as they should be on a pi) regular compaction resolves the issue in practice. |
Background
LND is fails when the channel.db reaches the size of 1GB
Described previously here:
raspiblitz/raspiblitz#1778
Now I had this repeatedly happen on two nodes and one has no leeway for further compaction.
Your environment
lnd
: v0.11.1Steps to reproduce
Run
lnd
and once the channel.db grows over 1GB it restarts repeatedly and reproducibly.This last node seems to alway fail at these log entries:
@Roasbeef has suggested that this might be the case after 2GB-s on a 32bit system, but seems it came earlier.
Working to migrate the project to 64bit OS: raspiblitz/raspiblitz#1199
And I will certainly do it for these nodes asap.
Do you think that switching the database between the 32 and 64 bit ARM architecture poses further risks?
Can be significant problem since the 32bit OS is the default recommendation for Raspberry Pi-s and many projects are based on that inlcuding: RaspiBlitz, RaspiBolt, myNode, Umbrel etc. (Nodl uses different SBC-s and
aarch64
architecture from the beginning)The text was updated successfully, but these errors were encountered: