Corrupted large badger repo #5213

ghost · 2018-07-10T15:59:24Z

Version information: 0.4.16-rc2

Type: bug

Description:

I got his large badgerds repo (~5TB) which I recently update to 0.4.16-rc2. After the included 6-to-7 repo migration, my repo is corrupted. I'm not sure whether I hard-killed the daemon before the update. The version it was previously running is 8b383da which was first included in 0.4.15-rc1.

# IPFS_PATH=/ipfs/ipfs_master/repo ./ipfs daemon
Initializing daemon...
Error: Unable to replay value log: "/ipfs/ipfs_master/repo/badgerds/000017.vlog": Data corruption detected. Value log truncate required to run DB. This would result in data loss.
Received interrupt signal, shutting down...
(Hit ctrl-c again to force-shutdown the daemon.)

I tried doing a badger backup with truncation enabled, but that didn't actually go and truncate stuff:

# badger --vlog-dir badgerds/ --dir badgerds/ backup -t
Listening for /debug HTTP requests at port: 8080
Error: Unable to replay value log: "badgerds//000017.vlog": Value log truncate required to run DB. This might result in data loss.
Usage:
  badger backup [flags]

Flags:
  -f, --backup-file string   File to backup to (default "badger.bak")
  -h, --help                 help for backup
  -t, --truncate             Allow value log truncation if required.

Global Flags:
      --dir string        Directory where the LSM tree files are located. (required)
      --vlog-dir string   Directory where the value log files are located, if different from --dir

Unable to replay value log: "badgerds//000017.vlog": Value log truncate required to run DB. This might result in data loss.

The text was updated successfully, but these errors were encountered:

schomatis · 2018-07-10T16:17:33Z

Thanks for reporting this issue, it's my main concern regarding the Badger transition, two separate issues:

Possible data loss after kill, this is expected and by default Badger will require explicit consent to truncate the corrupted values, but I'm wondering, what tool and information does an IPFS user has to actually take the required measures (enable truncation) to continue using the repo?
Flag -t not working, I'll have to take a closer look at the backup command, and possibly your repo.

ghost · 2018-07-10T16:21:35Z

Let me know an SSH key :)

schomatis · 2018-07-10T16:56:52Z

There's a windows related check in the badger truncation function that it's catching my attention, it rejects it if the value log has been loaded with mmap, @magik6k is there an easy way to pass the ValueLogLoadingMode option through the config file to set it not to use the (default) mmap?

schomatis · 2018-07-10T17:12:00Z

Actually, if you're running the badger command from the cloned git repo, @lgierth, could you do a temporary modification of the ValueLogLoadingMode default option, change it to FileIO, rebuild and retry the command?

ghost · 2018-07-10T19:22:24Z

It did change something:

Unable to replay value log: "badgerds//000017.vlog": truncate badgerds//000017.vlog: invalid argument

schomatis · 2018-07-10T22:50:43Z

Yes, this is a different problem, I'll raise the corresponding issues at Badger.

manishrjain · 2018-07-11T17:18:20Z

Can you try not passing the slash at the end? So, it doesn't have two slashes in the file path: badgerds//000017.vlog. I'm not sure if that's the issue, but I suspect it might be an issue.

Also, what version of Badger are you on?

P.S. If you have more logs, it would better help understand what's happening here.

ghost · 2018-07-11T20:45:08Z

That didn't help unfortunately, neither as a relative nor absolute path. Is there anything I could pull out of :8080 while it's still running?

manishrjain · 2018-07-12T11:39:52Z

Can you expand more about the Badger version and the environment? Also, if you have access to the Badger directory, could you tar, gzip and upload it and send me a link? So, I could debug what's going on.

ghost · 2018-07-12T18:32:25Z

Also, if you have access to the Badger directory, could you tar, gzip and upload it and send me a link?

It's 5 TB unfortunately. Can give you access to the host though.

manishrjain · 2018-07-12T21:28:03Z

Sure. My email id is my first name at dgraph.io. Also, tell me the steps about what to do after logging in.

manishrjain · 2018-07-12T21:33:55Z

It looks like this is the line which is causing the issue. For some reason, it is unable to truncate the file:

https://github.com/dgraph-io/badger/blob/master/value.go#L329

schomatis · 2018-07-12T22:03:21Z

Can you expand more about the Badger version

The Badger version currently used in go-ipfs is v1.3.0,

https://ipfs.io/ipfs/QmeAEa8FDWAmZJTL6YcM1oEndZ4MyhCr5rTsjYZQui1x1L/badger

although @lgierth was using a much recent version to run the backup command that was failing, probably v1.5.x.

manishrjain · 2018-07-12T22:07:50Z

What filesystem is the environment using? Is it VFAT or EXT4 or something else?

schomatis · 2018-07-16T21:20:28Z

@lgierth Could you provide @manishrjain more details about the setup?

leerspace · 2018-07-21T22:06:49Z

I've also just encountered this issue on Windows with v0.4.16 on an NTFS partition. The update completed successfully (as far as I could tell) and I was able to use the repo for a while afterwards, but now I'm getting this error. The repo I lost is a lot smaller at 88GB, so I can share if it would be helpful.

schomatis · 2018-07-22T17:45:19Z

Hey @leerspace, there are different errors mentioned in this issue, are you getting the invalid argument one?

leerspace · 2018-07-22T19:32:41Z

@schomatis sorry for not being more clear. I'm getting the error in the first post: Error: Unable to replay value log: "C:\\Users\\user\\.ipfs\\badgerds\\000088.vlog": Data corruption detected. Value log truncate required to run DB. This would result in data loss..

schomatis · 2018-07-22T20:10:16Z

Ok, this may be a consequence of many possible factors, but most possibly a crash or a hard-kill of an ipfs command. It's an acceptable scenario, but we're not providing a truncate flag at the moment or any tool to bypass this (see #5213 (comment) point 1), I'll open another issue about this.

If you want you could try the badger backup -t to try to remove the corrupted part of the DB (that should be only a small fraction of it).

leerspace · 2018-07-23T17:39:35Z

@schomatis I just finished running badger --vlog-dir badgerds/ --dir badgerds/ backup -t from OP and it completed successfully from what I can tell, but now I get bunch of disk IO followed by a different error when trying to start the daemon. It's different than what's in this issue so I can open a separate one for my new error.

manishrjain · 2018-08-17T02:22:59Z

So, Go's truncate function is failing:

2018/08/17 04:18:33 Iterating file id: 16
2018/08/17 04:18:33 Replaying log file: 16. Running count: 2000
2018/08/17 04:18:33 Replaying log file: 16. Running count: 4000
2018/08/17 04:18:33 Iteration took: 214.280716ms
2018/08/17 04:18:33 Iterating file id: 17
panic: offset: 0. Err: truncate ./000017.vlog: invalid argument

goroutine 1 [running]:
github.com/dgraph-io/badger.(*valueLog).iterate(0xc4200f3d48, 0xc4cb72c8a0, 0x0, 0xc5956470e0, 0x1, 0x0)
	/root/go/src/github.com/dgraph-io/badger/value.go:335 +0xa59
github.com/dgraph-io/badger.(*valueLog).Replay(0xc4200f3d48, 0x0, 0xc400000000, 0xc5956470e0, 0x0, 0x0)
	/root/go/src/github.com/dgraph-io/badger/value.go:779 +0x32d

Code:

        if vlog.opt.Truncate && truncate && len(lf.fmap) == 0 {
                // Only truncate if the file isn't mmaped. Otherwise, Windows would puke.
                if err := lf.fd.Truncate(int64(validEndOffset) + 1); err != nil {
                        panic(fmt.Sprintf("offset: %d. Err: %v", validEndOffset, err.Error()))
                        return err
                }

I see that the root folder is on RAID array. I wonder if that's what's causing the issue -- this looks like a problem with either the standard file.Truncate library in Go, or a problem with the system itself.

├─sdc3    8:35   0   6.7G  0 part  
│ └─md2   9:2    0    20G  0 raid5 /

bonedaddy · 2018-08-17T19:22:06Z

have you tried doing a health check and/or repair of your raid array?

ghost · 2018-08-17T20:27:59Z

Yeeah spot on, one (of four) disks has died without us noticing. I don't even see log lines of when it died. The filesystem seems to be intact and complete, but whatever, let's call this host dead.

The data in the repo can be reproduced relatively easily. (It's really just the cdn.media.ccc.de mirror that needs reproducing.)

manishrjain · 2018-08-20T22:37:46Z

You could copy over this data to another host, and verify that Badger is doing the right thing. Not sure there's anything else we need to do from Badger's end, so I'm considering this issue closed.

schomatis · 2018-08-20T23:43:17Z

Not sure there's anything else we need to do from Badger's end, so I'm considering this issue closed.

Agreed, I'm closing the issue on the Badger end, thanks for investigating this issue @manishrjain which wasn't actually related to Badger.

You could copy over this data to another host, and verify that Badger is doing the right thing.

Could you do this @lgierth to be extra sure? Or is the DB too big to perform a full copy?

ghost · 2018-10-05T02:31:46Z

This has been solved -- the underlying mdadm RAID got into a weird state and might have corrupted/lost data.

ghost added topic/repo Topic repo topic/badger Topic badger labels Jul 10, 2018

ghost assigned schomatis Jul 10, 2018

This was referenced Jul 10, 2018

Enable truncation in non Windows environments dgraph-io/badger#523

Closed

Truncation error: invalid argument dgraph-io/badger#524

Closed

Stebalien added the need/analysis Needs further analysis before proceeding label Jul 13, 2018

Stebalien mentioned this issue Jul 17, 2018

Make badger-ds the default datastore #4279

Open

14 tasks

schomatis mentioned this issue Jul 22, 2018

Provide a way to truncate a corrupted Badger DB #5275

Closed

This was referenced Jul 23, 2018

badgerds fails to initialize (after a kill?) #4363

Closed

Corrupted badger database - Assert failed #5280

Closed

ghost closed this as completed Oct 5, 2018

manishrjain mentioned this issue Oct 21, 2018

Issue truncating vlog after crash dgraph-io/badger#613

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted large badger repo #5213

Corrupted large badger repo #5213

ghost commented Jul 10, 2018

schomatis commented Jul 10, 2018

ghost commented Jul 10, 2018

schomatis commented Jul 10, 2018

schomatis commented Jul 10, 2018

ghost commented Jul 10, 2018

schomatis commented Jul 10, 2018

manishrjain commented Jul 11, 2018 •

edited

Loading

ghost commented Jul 11, 2018

manishrjain commented Jul 12, 2018

ghost commented Jul 12, 2018

manishrjain commented Jul 12, 2018

manishrjain commented Jul 12, 2018

schomatis commented Jul 12, 2018

manishrjain commented Jul 12, 2018

schomatis commented Jul 16, 2018

leerspace commented Jul 21, 2018

schomatis commented Jul 22, 2018

leerspace commented Jul 22, 2018

schomatis commented Jul 22, 2018

leerspace commented Jul 23, 2018

manishrjain commented Aug 17, 2018

bonedaddy commented Aug 17, 2018

ghost commented Aug 17, 2018

manishrjain commented Aug 20, 2018

schomatis commented Aug 20, 2018

ghost commented Oct 5, 2018

Corrupted large badger repo #5213

Corrupted large badger repo #5213

Comments

ghost commented Jul 10, 2018

Version information: 0.4.16-rc2

Type: bug

Description:

schomatis commented Jul 10, 2018

ghost commented Jul 10, 2018

schomatis commented Jul 10, 2018

schomatis commented Jul 10, 2018

ghost commented Jul 10, 2018

schomatis commented Jul 10, 2018

manishrjain commented Jul 11, 2018 • edited Loading

ghost commented Jul 11, 2018

manishrjain commented Jul 12, 2018

ghost commented Jul 12, 2018

manishrjain commented Jul 12, 2018

manishrjain commented Jul 12, 2018

schomatis commented Jul 12, 2018

manishrjain commented Jul 12, 2018

schomatis commented Jul 16, 2018

leerspace commented Jul 21, 2018

schomatis commented Jul 22, 2018

leerspace commented Jul 22, 2018

schomatis commented Jul 22, 2018

leerspace commented Jul 23, 2018

manishrjain commented Aug 17, 2018

bonedaddy commented Aug 17, 2018

ghost commented Aug 17, 2018

manishrjain commented Aug 20, 2018

schomatis commented Aug 20, 2018

ghost commented Oct 5, 2018

manishrjain commented Jul 11, 2018 •

edited

Loading