Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LND restarts with over 1GB channel.db on 32bit linux (RaspberryPi 4) #4811

Closed
openoms opened this issue Nov 30, 2020 · 23 comments
Closed

LND restarts with over 1GB channel.db on 32bit linux (RaspberryPi 4) #4811

openoms opened this issue Nov 30, 2020 · 23 comments

Comments

@openoms
Copy link
Contributor

openoms commented Nov 30, 2020

Background

LND is fails when the channel.db reaches the size of 1GB
Described previously here:
raspiblitz/raspiblitz#1778

Now I had this repeatedly happen on two nodes and one has no leeway for further compaction.

Your environment

Steps to reproduce

Run lnd and once the channel.db grows over 1GB it restarts repeatedly and reproducibly.
This last node seems to alway fail at these log entries:

2020-11-30 17:36:57.455 [TRC] CHDB: Pruning nodes from graph with no open channels
2020-11-30 17:36:57.502 [INF] CHDB: Pruned unconnected node 038b731b984c3e9ef0d1710ffcd2ecdbddbf1d6097e900107d94ead7b7b1fd4956 from channel graph                                                                 
2020-11-30 17:36:57.503 [INF] CHDB: Pruned unconnected node 03ba8210cea60dc7a98a0b6c87a71622d85a664ec203663dcebca09177de7a2563 from channel graph                                                                 
2020-11-30 17:36:57.503 [INF] CHDB: Pruned unconnected node 021b00feca3ea07602cc12a44423735540d8212041864e9cc3c856a6171be88cc7 from channel graph                                                                 
2020-11-30 17:36:57.504 [INF] CHDB: Pruned unconnected node 02151b861f4825eaed7c0e582c46e6af624ae61819eec7ae31f348e3dfa9615d6b from channel graph                                                                 
2020-11-30 17:36:57.504 [INF] CHDB: Pruned unconnected node 021cd9c8d3cb0f934a6e446fc2a7db8e44c5a31b15d181d7d2f07128faf387dd27 from channel graph                                                                 
2020-11-30 17:36:57.504 [INF] CHDB: Pruned 5 unconnected nodes from the channel graph

@Roasbeef has suggested that this might be the case after 2GB-s on a 32bit system, but seems it came earlier.

Working to migrate the project to 64bit OS: raspiblitz/raspiblitz#1199
And I will certainly do it for these nodes asap.
Do you think that switching the database between the 32 and 64 bit ARM architecture poses further risks?

Can be significant problem since the 32bit OS is the default recommendation for Raspberry Pi-s and many projects are based on that inlcuding: RaspiBlitz, RaspiBolt, myNode, Umbrel etc. (Nodl uses different SBC-s and aarch64 architecture from the beginning)

@Roasbeef
Copy link
Member

Do you have a breakdown of the file sizes in the entire .lnd directory?

@openoms
Copy link
Contributor Author

openoms commented Nov 30, 2020

Do you have a breakdown of the file sizes in the entire .lnd directory?

Sure:

$ sudo du -ah /mnt/hdd/lnd/
16K	/mnt/hdd/lnd/data/chain/bitcoin/mainnet/channel.backup
4.0K	/mnt/hdd/lnd/data/chain/bitcoin/mainnet/walletkit.macaroon
4.0K	/mnt/hdd/lnd/data/chain/bitcoin/mainnet/chainnotifier.macaroon
4.0K	/mnt/hdd/lnd/data/chain/bitcoin/mainnet/admin.macaroon
4.0K	/mnt/hdd/lnd/data/chain/bitcoin/mainnet/invoices.macaroon
20K	/mnt/hdd/lnd/data/chain/bitcoin/mainnet/macaroons.db
2.0M	/mnt/hdd/lnd/data/chain/bitcoin/mainnet/wallet.db
4.0K	/mnt/hdd/lnd/data/chain/bitcoin/mainnet/signer.macaroon
4.0K	/mnt/hdd/lnd/data/chain/bitcoin/mainnet/invoice.macaroon
4.0K	/mnt/hdd/lnd/data/chain/bitcoin/mainnet/router.macaroon
4.0K	/mnt/hdd/lnd/data/chain/bitcoin/mainnet/readonly.macaroon
2.1M	/mnt/hdd/lnd/data/chain/bitcoin/mainnet
2.1M	/mnt/hdd/lnd/data/chain/bitcoin
2.1M	/mnt/hdd/lnd/data/chain
9.3M	/mnt/hdd/lnd/data/watchtower/bitcoin/mainnet/watchtower.db
9.3M	/mnt/hdd/lnd/data/watchtower/bitcoin/mainnet
9.3M	/mnt/hdd/lnd/data/watchtower/bitcoin
4.0K	/mnt/hdd/lnd/data/watchtower/v3_onion_private_key
9.3M	/mnt/hdd/lnd/data/watchtower
21M	/mnt/hdd/lnd/data/graph/mainnet/sphinxreplay.db
1008M	/mnt/hdd/lnd/data/graph/mainnet/channel.db
33M	/mnt/hdd/lnd/data/graph/mainnet/wtclient.db
1.0G	/mnt/hdd/lnd/data/graph/mainnet/uncompacted.db
2.1G	/mnt/hdd/lnd/data/graph/mainnet
2.1G	/mnt/hdd/lnd/data/graph
2.1G	/mnt/hdd/lnd/data
4.0K	/mnt/hdd/lnd/v3_onion_private_key
4.0K	/mnt/hdd/lnd/lnd.conf
3.4M	/mnt/hdd/lnd/logs/bitcoin/mainnet/lnd.log
2.0M	/mnt/hdd/lnd/logs/bitcoin/mainnet/lnd.log.21554.gz
2.8M	/mnt/hdd/lnd/logs/bitcoin/mainnet/lnd.log.21553.gz
2.0M	/mnt/hdd/lnd/logs/bitcoin/mainnet/lnd.log.21552.gz
11M	/mnt/hdd/lnd/logs/bitcoin/mainnet
11M	/mnt/hdd/lnd/logs/bitcoin
11M	/mnt/hdd/lnd/logs
4.0K	/mnt/hdd/lnd/tls.key
4.0K	/mnt/hdd/lnd/lnd.conf.save
4.0K	/mnt/hdd/lnd/tls.cert
2.1G	/mnt/hdd/lnd/

Actually I left the backup up of the channel.db (uncompacted.db) in there too.
Now moved it out (just 5 mins ago - so no restarts since). Could that be part of the problem?

@guggero
Copy link
Collaborator

guggero commented Nov 30, 2020

Having the uncompacted.db file in there should have no effect at all. Probably just coincidence that it's running longer now.

As to your question about migrating the channel DB from 32bit to 64bit: This is something that I'd like to have a definitive answer to as well. In the past I've always recommended to not migrate between operating systems and/or architectures.
But digging a bit into bbolt, it seems there at least is the intention of being cross-OS/cross-arch compatible. At least according to this issue there only seem to be differences between Windows and the rest, and only for large files. Maybe you could try running that sample code provided there both compiled as 32bit and 64bit binary?

For the affected nodes, I'd probably try compiling the bbolt binary with 64bit and run compaction on the 32bit channel.db (and other DBs too). If that doesn't result in errors and bbolt check neither, I'd dare to then run everything with a 64bit lnd binary.
But perhaps try this with a larger testnet node first? Because once you started the "migrated" channel.db you shouldn't just go back and run the old one, just to make sure a stray fee update doesn't risk you publishing an old state when going back to the old DB.

@openoms
Copy link
Contributor Author

openoms commented Dec 1, 2020

LND failed again today (after I was able to gain some megabytes with compaction the last time) and now even chantools fails:

$ chantools compactdb --sourcedb /mnt/hdd/lnd/data/graph/mainnet/channel.db                 --destdb /mnt/hdd/lnd/data/graph/mainnet/compacted.db
2020-12-01 13:17:03.718 [INF] CHAN: chantools version v0.5.1 commit v0.5.1
unexpected fault address 0x22aa0044
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x22aa0044 pc=0x2eb55c]

goroutine 1 [running]:
runtime.throw(0x67fbcc, 0x5)
	/usr/local/go/src/runtime/panic.go:774 +0x5c fp=0x1c3da28 sp=0x1c3da14 pc=0x3ecec
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:401 +0x310 fp=0x1c3da40 sp=0x1c3da28 pc=0x54010
github.com/coreos/bbolt.(*DB).meta(0x1d7c140, 0x376)
	/home/bitcoin/go/pkg/mod/github.com/coreos/bbolt@v1.3.3/db.go:901 +0x1c fp=0x1c3da5c sp=0x1c3da44 pc=0x2eb55c
github.com/coreos/bbolt.(*DB).hasSyncedFreelist(...)
	/home/bitcoin/go/pkg/mod/github.com/coreos/bbolt@v1.3.3/db.go:323
github.com/coreos/bbolt.(*Tx).rollback(0x1eb2580)
	/home/bitcoin/go/pkg/mod/github.com/coreos/bbolt@v1.3.3/tx.go:279 +0x68 fp=0x1c3da74 sp=0x1c3da5c pc=0x2f493c
github.com/coreos/bbolt.(*Tx).Commit(0x1eb2580, 0x2ac7b21a, 0x8)
	/home/bitcoin/go/pkg/mod/github.com/coreos/bbolt@v1.3.3/tx.go:161 +0x430 fp=0x1c3db14 sp=0x1c3da74 pc=0x2f4374
main.(*compactDBCommand).compact.func2(0x20496a0, 0x1, 0x1, 0x2ac7b374, 0x8, 0x8, 0x2ac7b37c, 0x152, 0x152, 0x0, ...)
	/home/bitcoin/chantools/cmd/chantools/compactdb.go:77 +0x318 fp=0x1c3db58 sp=0x1c3db14 pc=0x559428
main.(*compactDBCommand).walkBucket(0x1c0bd80, 0x206fda0, 0x20496a0, 0x1, 0x1, 0x2ac7b374, 0x8, 0x8, 0x2ac7b37c, 0x152, ...)
	/home/bitcoin/chantools/cmd/chantools/compactdb.go:161 +0x74 fp=0x1c3dbac sp=0x1c3db58 pc=0x54a8b0
main.(*compactDBCommand).walkBucket.func1(0x2ac7b374, 0x8, 0x8, 0x2ac7b37c, 0x152, 0x152, 0x152, 0x0)
	/home/bitcoin/chantools/cmd/chantools/compactdb.go:186 +0x264 fp=0x1c3dc04 sp=0x1c3dbac pc=0x559980
github.com/coreos/bbolt.(*Bucket).ForEach(0x206fda0, 0x1c3dc70, 0x0, 0x0)
	/home/bitcoin/go/pkg/mod/github.com/coreos/bbolt@v1.3.3/bucket.go:388 +0xf8 fp=0x1c3dc38 sp=0x1c3dc04 pc=0x2e5868
main.(*compactDBCommand).walkBucket(0x1c0bd80, 0x206fda0, 0x0, 0x0, 0x0, 0x66c98485, 0x1b, 0x1b, 0x0, 0x0, ...)
	/home/bitcoin/chantools/cmd/chantools/compactdb.go:172 +0x124 fp=0x1c3dc8c sp=0x1c3dc38 pc=0x54a960
main.(*compactDBCommand).walk.func1.1(0x66c98485, 0x1b, 0x1b, 0x206fda0, 0x206fda0, 0x0)
	/home/bitcoin/chantools/cmd/chantools/compactdb.go:151 +0x13c fp=0x1c3dcd4 sp=0x1c3dc8c pc=0x559664
github.com/coreos/bbolt.(*Tx).ForEach.func1(0x66c98485, 0x1b, 0x1b, 0x0, 0x0, 0x0, 0x0, 0x0)
	/home/bitcoin/go/pkg/mod/github.com/coreos/bbolt@v1.3.3/tx.go:129 +0x70 fp=0x1c3dcf4 sp=0x1c3dcd4 pc=0x2f77d4
github.com/coreos/bbolt.(*Bucket).ForEach(0x1c8c18c, 0x1c3dd3c, 0x1d6e960, 0x0)
	/home/bitcoin/go/pkg/mod/github.com/coreos/bbolt@v1.3.3/bucket.go:388 +0xf8 fp=0x1c3dd28 sp=0x1c3dcf4 pc=0x2e5868
github.com/coreos/bbolt.(*Tx).ForEach(0x1c8c180, 0x1c63d5c, 0x2eabb4, 0x1d7c000)
	/home/bitcoin/go/pkg/mod/github.com/coreos/bbolt@v1.3.3/tx.go:128 +0x58 fp=0x1c3dd48 sp=0x1c3dd28 pc=0x2f3e38
main.(*compactDBCommand).walk.func1(0x1c8c180, 0x1d7c200, 0x1c8c180)
	/home/bitcoin/chantools/cmd/chantools/compactdb.go:145 +0x54 fp=0x1c3dd68 sp=0x1c3dd48 pc=0x5596f4
github.com/coreos/bbolt.(*DB).View(0x1d7c000, 0x1c63dc0, 0x0, 0x0)
	/home/bitcoin/go/pkg/mod/github.com/coreos/bbolt@v1.3.3/db.go:725 +0x90 fp=0x1c3ddac sp=0x1c3dd68 pc=0x2eac10
main.(*compactDBCommand).walk(0x1c0bd80, 0x1d7c000, 0x1c63e18, 0x0, 0x0)
	/home/bitcoin/chantools/cmd/chantools/compactdb.go:144 +0x54 fp=0x1c3ddcc sp=0x1c3ddac pc=0x54a814
main.(*compactDBCommand).compact(0x1c0bd80, 0x1d7c140, 0x1d7c000, 0x0, 0x0)
	/home/bitcoin/chantools/cmd/chantools/compactdb.go:72 +0xe4 fp=0x1c3de30 sp=0x1c3ddcc pc=0x54a73c
main.(*compactDBCommand).Execute(0x1c0bd80, 0x1c6c0c0, 0x0, 0x5, 0x1c0bd80, 0x1)
	/home/bitcoin/chantools/cmd/chantools/compactdb.go:39 +0x1bc fp=0x1c3de5c sp=0x1c3de30 pc=0x54a488
github.com/jessevdk/go-flags.(*Parser).ParseArgs(0x1c163c0, 0x1c16038, 0x5, 0x5, 0xae, 0x0, 0x0, 0x5ec4a0, 0x1c88220)
	/home/bitcoin/go/pkg/mod/github.com/jessevdk/go-flags@v1.4.0/parser.go:316 +0x664 fp=0x1c3df34 sp=0x1c3de5c pc=0x53bee8
github.com/jessevdk/go-flags.(*Parser).Parse(...)
	/home/bitcoin/go/pkg/mod/github.com/jessevdk/go-flags@v1.4.0/parser.go:186
main.runCommandParser(0x1, 0x14e0c)
	/home/bitcoin/chantools/cmd/chantools/main.go:160 +0x798 fp=0x1c3df70 sp=0x1c3df34 pc=0x54f240
main.main()
	/home/bitcoin/chantools/cmd/chantools/main.go:53 +0x14 fp=0x1c3dfa4 sp=0x1c3df70 pc=0x54e9f4
runtime.main()
	/usr/local/go/src/runtime/proc.go:203 +0x208 fp=0x1c3dfe4 sp=0x1c3dfa4 pc=0x40c6c
runtime.goexit()
	/usr/local/go/src/runtime/asm_arm.s:868 +0x4 fp=0x1c3dfe4 sp=0x1c3dfe4 pc=0x6a734

goroutine 7 [select]:
io.(*pipe).Read(0x1c16390, 0x1d62000, 0x1000, 0x1000, 0xc50039, 0xea24c, 0x1)
	/usr/local/go/src/io/pipe.go:50 +0xac
io.(*PipeReader).Read(0x1c7caf0, 0x1d62000, 0x1000, 0x1000, 0x4b, 0x4, 0x0)
	/usr/local/go/src/io/pipe.go:127 +0x38
bufio.(*Reader).fill(0x2059f84)
	/usr/local/go/src/bufio/bufio.go:100 +0x108
bufio.(*Reader).ReadSlice(0x2059f84, 0xa, 0x1, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:359 +0x2c
bufio.(*Reader).ReadLine(0x2059f84, 0xc50039, 0x1, 0x1, 0x1, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:388 +0x24
github.com/jrick/logrotate/rotator.(*Rotator).Run(0x1c16360, 0x89bea0, 0x1c7caf0, 0x0, 0x0)
	/home/bitcoin/go/pkg/mod/github.com/jrick/logrotate@v1.0.0/rotator/rotator.go:100 +0x90
github.com/lightningnetwork/lnd/build.(*RotatingLogWriter).InitLogRotator.func1(0x2080ee0, 0x1c7caf0)
	/home/bitcoin/go/pkg/mod/github.com/guggero/lnd@v0.9.0-beta-rc4.0.20200826102054-8c9171307182/build/logrotator.go:80 +0x30
created by github.com/lightningnetwork/lnd/build.(*RotatingLogWriter).InitLogRotator
	/home/bitcoin/go/pkg/mod/github.com/guggero/lnd@v0.9.0-beta-rc4.0.20200826102054-8c9171307182/build/logrotator.go:79 +0x288

I have the 64bit ARM node ready, already restored a (small) database from a 32bit node.
Will try to run the compaction there first.

@openoms
Copy link
Contributor Author

openoms commented Dec 1, 2020

Went on to migrate the 1GB channel.db to a 64bit ARM system. Looking good so far.
All channels and peers are online and can't see any errors in the LND.log.
Will give it some time now.

@guggero
Copy link
Collaborator

guggero commented Dec 1, 2020

[signal SIGSEGV: segmentation violation code=0x1 addr=0x22aa0044 pc=0x2eb55c]

Oh oh, that sounds like an actual data corruption issue... Do you get the same error if you start that DB with lnd again?

@openoms
Copy link
Contributor Author

openoms commented Dec 1, 2020

Oh oh, that sounds like an actual data corruption issue... Do you get the same error if you start that DB with lnd again?

I could not start the 1GB+ channel.db on the 32bit system at all (LND restarted immediately), so went on migrating it to the aarch64 system and it is running there since without errors.

Could the chantools error be the same problem with bbolt and the size of the database?

@guggero
Copy link
Collaborator

guggero commented Dec 1, 2020

Ah, okay. Yeah, it could be, what build/binary version of chantools did you run the command with? If there's no error when using 64bit ARM chantools, that's probably the problem.

@openoms
Copy link
Contributor Author

openoms commented Dec 1, 2020

I built chantools from the source at v0.5.1 on the 32bit system specifying the environment:
CGO_ENABLED=0 GOOS=linux GOARCH=arm GOARM=7 make install
This has worked well before, but could only shrink the database less and less with repeated compactions. On the last one could only gain a few megabytes and it grew back in a couple of days again.

Testing chantools compactdb from the latest source on aarc64 resulted in no errors and went from:
1062508 to 1038376 bytes. LND started again without issues.

It seems that using a 64bit system does solve this issue, but can be a returning question since many projects are based off the default 32bit Raspbian.

@guggero
Copy link
Collaborator

guggero commented Dec 1, 2020

That's good to know that a 32bit binary can result in a segfault for large DBs and that updating to 64bit helps.

But that doesn't solve the problem for existing users, I agree. We added the gc-canceled-invoices-on-startup flag that does some garbage collection. It's only on master now but will be in the next version. If you run with that set to true, then shutdown and compact again, do you see a big reduction in the size?

@openoms
Copy link
Contributor Author

openoms commented Dec 2, 2020

Just to confirm to test the garbage collection one should:

  • stop and run chantools compactdb
  • check channel.db size
  • build LND from source at the latest master (before lnd 0.12)
  • start the daemon with lnd --gc-canceled-invoices-on-startup (how long to wait for or look for some message in the lnd.log?)
  • check channel.db size
  • stop and run chantools compactdb
  • check the compacted.db size

Will the garbage collection and/or the compaction feature be on by default (happen on restarts) in lnd v0.12 or remains an option to be run?

@guggero
Copy link
Collaborator

guggero commented Dec 2, 2020

Yes, that's how I would try it as well. Though this might not do much if most data isn't from canceled invoices. It really depends on the usage of the node, but for now that is the only automatic garbage collection we've implemented (AFIK).
You should see a message in the logs mentioning the garbage collection. I don't recall the exact message.

Just to be clear: If you do this, you won't be able to downgrade the node to lnd v0.11.1-beta because of the database migrations it contains.

There is a new flag to run auto compaction on startup: --db.bolt.auto-compact
You might also want to set --db.bolt.auto-compact-min-age=0 to enable compaction on every restart and not just every week (which is the default). If you use that, you can skip the chantools compactdb calls and instead just restart lnd to achieve the same effect.

@openoms
Copy link
Contributor Author

openoms commented Dec 2, 2020

Just had a report from another active node (not under my control) experiencing the same immediate restarts with LND:

admin@raspberrypi:~ $ sudo du -h /mnt/hdd/lnd/data/graph/mainnet/channel.db
1.0G    /mnt/hdd/lnd/data/graph/mainnet/channel.db

and then with chantools (compiled for 32bit as described here: https://github.com/openoms/lightning-node-management/blob/master/LNDdatabaseCompaction.md):

bitcoin@raspberrypi:~/chantools$ chantools compactdb --sourcedb /mnt/hdd/lnd/data/graph/mainnet/channel.db \
>                 --destdb /mnt/hdd/lnd/data/graph/mainnet/compacted.db
2020-12-02 18:05:45.502 [INF] CHAN: chantools version v0.6.0 commit v0.6.0-1-gf82d78d
unexpected fault address 0x22a95044
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x22a95044 pc=0x2eb55c]

He did not run chantools compactdb before.

@guggero
Copy link
Collaborator

guggero commented Dec 2, 2020

That means the DB is too large to be opened with any 32bit process. If the user doesn't run a 64bit OS, they could copy it to a 64bit ARM device and run compaction there, with chantools compiled with 64bit. If compaction was never run, the benefit might be large to sufficiently postpone the problem again (until RaspiBlitz updates to 64bit in general).

@openoms
Copy link
Contributor Author

openoms commented Dec 2, 2020

Yes, it sounds clear that there is no solution to this on 32bit ARM and I have just been lucky to be able to run compaction when the issue presented the first time.

As he has no other ARM device advised to try the already functional 64bit RaspiBlitz version: raspiblitz/raspiblitz#1199 (comment). PR: raspiblitz/raspiblitz#1833

Thank you for the explanation about the upcoming features of LND:
--gc-canceled-invoices-on-startup and --db.bolt.auto-compact will help to delay this issue for most and hopefully will give enough time for everyone to update to 64bit pre-emptively.

@openoms
Copy link
Contributor Author

openoms commented Dec 9, 2020

Another node reaching 1GB channel.db on 32bit ARM, just documenting the error message. Nothing left, but to migrate to 64 bit.

[INF] LTND: Version: 0.11.1-beta commit=v0.11.1-beta, build=production, logging=default
[INF] LTND: Active chain: Bitcoin (network=mainnet)
[INF] LTND: Opening the main database, this might take a few minutes...
[INF] LTND: Opening bbolt database, sync_freelist=true
[INF] CHDB: Checking for schema update: latest_version=17, db_version=17
[INF] LTND: Database now open (time_to_open=11.810386ms)!
[INF] RPCS: password gRPC proxy started at 0.0.0.0:8080
[INF] RPCS: password RPC server listening on 0.0.0.0:10009
[INF] LTND: Waiting for wallet encryption password. Use `lncli create` to create a wallet, `lncli unlock` to unlock an existing wallet, or `lncli changepassword` to change the password of an existing wallet and unlock it.
[ERR] LNWL: Failed to open database: cannot allocate memory

$ sudo du  /mnt/hdd/lnd/data/graph/mainnet/channel.db
1048544	/mnt/hdd/lnd/data/graph/mainnet/channel.db

$ lncli unlock
Input wallet password: 
[lncli] rpc error: code = Unknown desc = cannot allocate memory

@openoms
Copy link
Contributor Author

openoms commented Apr 5, 2021

Update, as there is no other solution proposed here I am testing the 64bit Rasperry OS with the RaspiBlitz since opening the issue:
raspiblitz/raspiblitz#1199

The next (v1.7) SDcard release will be based on the 64bit base image and people can already download the RC1 from the dev branch: https://github.com/rootzoll/raspiblitz/tree/dev#downloading-the-software

If a 32bit system has this problem starting with the 64bit image is tested to solve the issue and LND will start again.

@Roasbeef
Copy link
Member

Closing this as the issue is with the 32-bit systems.

@renepickhardt
Copy link

Hey @Roasbeef sorry but I have to double check on this: Are you absolutely certain that you want to keep this closed and unfixed which would effectively mean that lnd does not support 32 bit systems in the future (with all the (un)intended consequences that may come with such a decision)?

@rkfg
Copy link

rkfg commented Aug 30, 2021

Two remarks:

  • for me moving the lnd data posed no issues, I migrated everything from an amd64 machine to armv7 (32 bit), bitcoind, electrumx and lnd data, and it all just worked; now I need to update the base system on my RPi to aarch64 before channels.db grows big enough... but there's still plenty of time. So I suppose there will be no issues if anyone decides to migrate, just make sure not to use the old data anywhere or nasty things will happen.
  • this size limitation is quite weird indeed, I'd expect 4Gb as it's the uint32 limit (or 2Gb for int32) but 1Gb is too little. Would be interesting to find the reason of these crashes. I only found this vague answer: Max database size boltdb/bolt#535

I suppose it should either be fixed or documented on the main page that lnd explicitly doesn't support 32 bit systems and it's okay for it to crash as soon as the database reaches 1Gb (which isn't that much for a busy node), and you're left to pick up the pieces. Yes, the mainstream distros like Umbrel and Raspiblitz migrated to aarch64 but the very popular Raspbian (that I use) is still 32 bit by default and the 64 bit version is so unofficial/beta that I had to google it, there's no mention of it on the download page. I didn't even know it exists at all!

@Roasbeef
Copy link
Member

Roasbeef commented Sep 2, 2021

@renepickhardt there's nothing to resolve on our end which is the reason why this issue is closed. This is a matter of the architecture that lnd is running on, as well as the kernel related settings and how that interacts with the default database.

The reason why users run into this for 32-bit systems is entirely related to their kernel settings. bbolt uses memory mapping by default to map the entire database onto the virtual memory address space. Ignoring everything else, this means that the database size can grow up to 4 GB (not all physically mapped, but simply addressable). Once the DB size gets over 1 GB, bolt then attempts to re-size the memory map to attempt to double it to 2GB or so. At this point what happens next is dependent on the memory split of the user space vs the kernel. If kernel is set to (as an example) occupy 3 GB of space, with 1 GB for the user, then this operation will fail (example for 4GB of RAM some pis have less).

In terms of the kernel here, if users only want to run 32-bit systems, then I believe activating the PAE extensions for the "high memory" kernel operating mode can help here. The issues users that pack everything running on a single pi (certainly not advised if you want reliability, but again you get what you pay for with a rasp pi), is that other processes (bitcoind, etc) are also competing for the address space. Depending on swap settings and addressable physical RAM, not everything may be able to fit.

I wager that only very old or very large nodes running on pi's etc (that haven't examined their kernel settings at all), need to deal with this. For larger nodes, if a node is that larger then it would behove the user to switch to more reliable hardware as pi's notoriously can run into hardware failure issues.

In practice, if a user ever detects this, then it's likely due to the fact that they may have never compacted their database. bolt doesn't actually reclaim the space when things are deleted, instead it keeps it all around on disk, putting the free pages in a free list for new DB operations. Users that run into this can usually just compact their database and the issue goes away. The latest versions of lnd (master to be made into 0.14) are also better about deleting state they not longer need, which results in smaller database sizes.

Beyond that, lnd 0.14 will ship with a newly optimized etcd backend, as well as initial support for postgres. For users seeking to operate a more reliable set up, both those options are certainly better than potentially storing all the data on an SD card.

@Roasbeef
Copy link
Member

Roasbeef commented Sep 2, 2021

To provide a bit more detail as to why the operation can fail (lets assume 4 GB of RAM, DB is 2 GB at this point), see this method: https://github.com/etcd-io/bbolt/blob/master/node.go#L517

What happens is the bolt needs to copy the entire database into heap memory temporarily to ensure that once it unmaps, then maps the database, the inodes, etc aren't pointing to stale memory. I think this is the error most people are running into: when then DB needs to double in size initially (and any 1 GB increments beyond that) there simply isn't enough addressable physical memory.

@Roasbeef
Copy link
Member

Roasbeef commented Sep 2, 2021

One attempt to mitigate this somewhat in the past was trying to set an initial memory map of the largest DB a 32-bit system can handle: btcsuite/btcwallet#697

This would mean that the DB never needs to be copied over as it would never need to be remapped. IIRC we fell short on testing on that, and also just generally concluded that in 2021 maybe the effort to make things slightly safer for 32-bit system may not have been worth it. Since then IIRC, we entertained removing the compiled 32-bit binaries, but then decided that for smaller nodes (as they should be on a pi) regular compaction resolves the issue in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants