Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown issue with badgerDS - keeps reading from the disk #7283

Open
RubenKelevra opened this issue May 6, 2020 · 10 comments
Open

Shutdown issue with badgerDS - keeps reading from the disk #7283

RubenKelevra opened this issue May 6, 2020 · 10 comments
Labels
exp/intermediate Prior experience is likely helpful kind/bug A bug in existing code (including security flaws) P2 Medium: Good to have, but can wait until someone steps up status/accepted This issue has been accepted topic/badger Topic badger topic/datastore Topic datastore

Comments

@RubenKelevra
Copy link
Contributor

Version information:

go-ipfs version: 0.6.0-dev
Repo version: 9
System version: amd64/linux
Golang version: go1.14.2

Commit 591c541

Description:

  • I created a fresh datastore with ipfs init --profile=badgerds.
  • Started the daemon
  • I pinned QmdB8kVBeWvLKyZrvxAAzrVfkLZC3zqcu6o7twLAqUcC67
  • IPFS run for some hours with no user input

Then I tried to shut down the daemon. Unexpectedly IPFS started to read on the disk, while nothing was written (according to iotop for minutes):

Screenshot_20200506_212433

The following experimental features was activated at the time in the config:

  • Filestore
  • URLStore
  • QUIC

Datastore config:

"Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "child": {
        "path": "badgerds",
        "syncWrites": false,
        "truncate": true,
        "type": "badgerds"
      },
      "prefix": "badger.datastore",
      "type": "measure"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "1000GB"
  },
  • The IPFS binary was set cap_net_bind_service=+ep to be able to run on port 443.
  • The environment variable LIBP2P_SWARM_FD_LIMIT was set to 1000.
  • IPFS was called with /usr/bin/ipfs daemon --init --migrate

I fetched the debug data and killed it with SIGABRT to get the stack trace - both are attached.

stacktrace.txt
debug.tar.gz

@RubenKelevra RubenKelevra added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 6, 2020
@Stebalien
Copy link
Member

Stebalien commented May 6, 2020

Ah, interesting. Badger is garbage collecting at that point. Or, to be accurate, it's scanning to see if there's anything that needs to be garbage collected.

I've filed dgraph-io/badger#1324. However, for now, we should probably do the same systemd-notify dance on shutdown.

@RubenKelevra
Copy link
Contributor Author

Let's

...?

I had to wait a while to get all data collected for the other bug report, but I guess on the startup badger is doing the same stuff, since the second startup is within 1 second.

@Stebalien
Copy link
Member

...?

Sorry, dangling edit.

I had to wait a while to get all data collected for the other bug report, but I guess on the startup badger is doing the same stuff, since the second startup is within 1 second.

Well, on startup badger may need to clean something if it was killed on shutdown. Otherwise, I'm not sure what it's doing.

@RubenKelevra
Copy link
Contributor Author

I had to wait a while to get all data collected for the other bug report, but I guess on the startup badger is doing the same stuff, since the second startup is within 1 second.

Well, on startup badger may need to clean something if it was killed on shutdown. Otherwise, I'm not sure what it's doing.

Yeah, the stack trace sounds like something like this is happening "valuelog open, valuelog replayLog, valuelog iterate".

But it's strange that the same datastore can be opened within a second if the first opening process is killed.

Maybe there's a detection for 'not clean recovery' which avoids the second attempt?

@Stebalien
Copy link
Member

Ah, badger may then recognize that the datastore is corrupted and, instead of trying to fix it, it just truncates the unsynced changes (we've configured it to do that because we explicitly call Sync() before/after pinning).

@RubenKelevra
Copy link
Contributor Author

Ah, interesting. Badger is garbage collecting at that point. Or, to be accurate, it's scanning to see if there's anything that needs to be garbage collected.

I thought about that again... what's the trigger for this garbage collection in the first place?

Shouldn't we just start to garbage collect right after the IPFS GC was running (which is not active in this setup)? 🤔

I mean, is there anything that badger is able to clean up, if we haven't run our own GC?

@Stebalien
Copy link
Member

ipfs/go-ds-badger#51

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented May 7, 2020

@Stebalien what data exactly is stored temporarily or semi-temporarily in the datastore which would collect if we wouldn't run the badger GC? DHT data?

If so, can I avoid having this background GC running if I switch to DHTclient?

Just searching for temporary solution that my shutdowns don't crash :)

I wrote regarding the badger GC in ipfs/go-ds-badger#54 (comment):

Maybe we can print a warning if we run a GC event on Badger-DB on the console, to at least inform the user, what's going on.

This would the behavior a bit more transparent.

@Stebalien
Copy link
Member

Stebalien commented May 7, 2020 via email

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented May 7, 2020

DHT data, local provider records, other misc stuff? I'd extend your shutdown timer for now. Also, how much data do you have?

That's is the real database, not the test-database:

[ipfs@vidar ~]$ ipfs repo stat --human
NumObjects: 677400
RepoSize:   154 GB
StorageMax: 1.0 TB
RepoPath:   /home/ipfs/.ipfs
Version:    fs-repo@9

But I plan to use a lot more storage on this server for another cluster... like 1-1.5 TB.

I mean, reading up to 2 TB with 2 MB/s isn't going to terminate any time soon (if really all data is read as well).

And shutting down hard on any security update, is also no good option either.

@Stebalien Stebalien added exp/intermediate Prior experience is likely helpful P2 Medium: Good to have, but can wait until someone steps up status/accepted This issue has been accepted topic/badger Topic badger topic/datastore Topic datastore and removed need/triage Needs initial labeling and prioritization labels May 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exp/intermediate Prior experience is likely helpful kind/bug A bug in existing code (including security flaws) P2 Medium: Good to have, but can wait until someone steps up status/accepted This issue has been accepted topic/badger Topic badger topic/datastore Topic datastore
Projects
None yet
Development

No branches or pull requests

2 participants