Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data integrity error after running out of disk space #356

Open
hoxu opened this issue Oct 12, 2015 · 20 comments
Open

Data integrity error after running out of disk space #356

hoxu opened this issue Oct 12, 2015 · 20 comments

Comments

@hoxu
Copy link

hoxu commented Oct 12, 2015

I ran into attic: Error: Data integrity error after bunch of [Errno 28] No space left on device errors.

The attic repository contains only 15 GiB of data.

Debian 8.2, so attic version is 0.13-1.

Is there any way to recover from this corruption?

@ThomasWaldmann
Copy link
Contributor

The error message doesn't contain a lot of information, but likely either some segment file or maybe the repo index got corrupted / incomplete due to the disk being full.

You could try the following (if the data in that repo is important, make a backup of the repo first):

  • make sure you have enough free space at the repo location
  • run: attic check repo
  • if it finds issues, rerun it: attic check --repair repo

If that did not help, you could retry like this:

  • make sure you have enough free space at the repo location
  • remove index.* and hints.* from the repo directory
  • run: attic check --repair repo

In general, avoid running out of free space (or even getting close to it).
attic needs some space to work, even for "prune" or "delete" operations!

@hoxu
Copy link
Author

hoxu commented Oct 16, 2015

$ attic list repo/
attic: Error: Data integrity error

$ attic check repo/
Starting repository check...
Error reading segment 4716
attic: Exiting with failure status due to previous errors

Should I go ahead and remove index.* and hints.* and run attic check --repair repo/?

@ThomasWaldmann
Copy link
Contributor

yes, try.

@hoxu
Copy link
Author

hoxu commented Oct 17, 2015

# time attic check --repair repo/
attic: Warning: 'check --repair' is an experimental feature that might result
in data loss.

Type "Yes I am sure" if you understand this and want to continue.

Do you want to continue? Yes I am sure
Starting repository check...
Error reading segment 4716
attempting to recover repo/data/0/4716
Repository check complete, no problems found.
Starting archive consistency check...
Analyzing archive 2015-09-05T02:58:41.checkpoint (1/229)
Traceback (most recent call last):
  File "/usr/bin/attic", line 3, in <module>
    main()
  File "/usr/lib/python3/dist-packages/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/lib/python3/dist-packages/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/lib/python3/dist-packages/attic/archiver.py", line 84, in do_check
    if not args.repo_only and not ArchiveChecker().check(repository, repair=args.repair):
  File "/usr/lib/python3/dist-packages/attic/archive.py", line 494, in check
    self.rebuild_refcounts()
  File "/usr/lib/python3/dist-packages/attic/archive.py", line 643, in rebuild_refcounts
    for item in robust_iterator(archive):
  File "/usr/lib/python3/dist-packages/attic/archive.py", line 622, in robust_iterator
    for item in unpacker:
  File "/usr/lib/python3/dist-packages/attic/archive.py", line 471, in __next__
    return next(self._unpacker)
  File "_unpacker.pyx", line 419, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:419)
  File "_unpacker.pyx", line 348, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:348)
TypeError: unhashable type: 'list'

real    366m34.972s
user    1m31.372s
sys     1m8.100s

@ThomasWaldmann
Copy link
Contributor

Hmm, seems it crashes in msgpack (that is a 3rd party lib used to pack/unpack binary data).

About the archives: you could try deleting all archives named "*.checkpoint", that is all intermediate stuff made while it is backing up and superseded once it reaches the end of the backup.

Hmm, I see debian jessie has msgpack 0.4.2. That might be an issue also (see other issues here).
attic 0.13 in debian jessie is also rather old. You may have some issues here that are fixed in the latest release versions, but still present in the debian jessie packages, unfortunately. Using a binary or a pip-based installation might be better here.

A general question: is there important data to recover from this corrupted repo or could you just start from scratch using more fresh code?

@hoxu
Copy link
Author

hoxu commented Oct 18, 2015

There do not seem to be any *.checkpoint files in the repository. Any other ideas? Or should I try repairing the repo with the newest version of attic?

This repository contains redundant data created with my conversion script rdiff-backup2attic, so I could regenerate it, but I would rather not, because the conversion took over a week or so. But more importantly, if it's possible to run into unrecoverable errors with attic, then I don't think I'll dare use it for real :(

It seems the newest version of attic in Debian is 0.13. I think I will file a bug downstream of this as well, what do you think they could do about fixing this issue in Debian Jessie? AFAIK, typically version upgrades are not done in Debian Stable, but fixes can be cherrypicked/backported.

@ThomasWaldmann
Copy link
Contributor

*.checkpoint is an archive name, see your log output from above.

Trying a newer version of attic AND msgpack may help (but that is not sure, to be sure we would need to point at the changeset that fixed your issue). If a binary release of attic works on your system you could try that rather easily.

But I am not even sure how exactly it produced that "unhashable type: list" it is falling over now. Or how to best deal with it - I guess that would need a debugging session on a system with the source code and the corrupt data set.

You could also try borgbackup, it sometimes gives better error msgs and also has some more fixes applied than attic, but I am not even sure if that your issue is fixed there. Trying to convert a corrupt repository would be something new and rather adventurous also, with unknown outcome.

@hoxu
Copy link
Author

hoxu commented Oct 21, 2015

# attic delete repo::2015-09-05T02:58:41.checkpoint
Initializing cache...
Analyzing archive: 2013-06-22T02:16:16
<omitted a lot of lines>
Analyzing archive: 2015-09-05T02:58:41.checkpoint
Traceback (most recent call last):
  File "/usr/bin/attic", line 3, in <module>
    main()
  File "/usr/lib/python3/dist-packages/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/lib/python3/dist-packages/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/lib/python3/dist-packages/attic/archiver.py", line 221, in do_delete
    cache = Cache(repository, key, manifest)
  File "/usr/lib/python3/dist-packages/attic/cache.py", line 33, in __init__
    self.sync()
  File "/usr/lib/python3/dist-packages/attic/cache.py", line 167, in sync
    for item in unpacker:
  File "_unpacker.pyx", line 419, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:419)
  File "_unpacker.pyx", line 348, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:348)
TypeError: unhashable type: 'list'

I guess trying a newer attic version is up next.

@hoxu
Copy link
Author

hoxu commented Oct 21, 2015

Attic 0.16:

attic delete repo::2015-09-05T02:58:41.checkpoint
<clip>
Analyzing archive: 2015-09-05T02:58:41.checkpoint
Traceback (most recent call last):
  File "/path/to/virtualenv-attic/bin/attic", line 3, in <module>
    main()
  File "/path/to/virtualenv-attic/lib/python3.4/site-packages/attic/archiver.py", line 730, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/path/to/virtualenv-attic/lib/python3.4/site-packages/attic/archiver.py", line 720, in run
    return args.func(args)
  File "/path/to/virtualenv-attic/lib/python3.4/site-packages/attic/archiver.py", line 233, in do_delete
    cache = Cache(repository, key, manifest)
  File "/path/to/virtualenv-attic/lib/python3.4/site-packages/attic/cache.py", line 60, in __init__
    self.sync()
  File "/path/to/virtualenv-attic/lib/python3.4/site-packages/attic/cache.py", line 216, in sync
    for item in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
TypeError: unhashable type: 'list'

So, attic 0.13 (which is in Debian Jessie) can corrupt the repository so that not even latest attic version can recover it. This seems like a grave issue to me.

@ThomasWaldmann
Copy link
Contributor

Well, we can't do much about Debian packaging old releases and then sticking to these, that is just their usual policy for "stable". Sometimes they make exceptions, this might well be such a case (but as attic developer does not make or maintain these packages, one would need to take this to the debian packagers).

But before doing that, it would be really useful to identify the root cause and create a fix (if possible) for this. See above about my offer for a debugging session.

@hoxu
Copy link
Author

hoxu commented Oct 21, 2015

I'm afraid I can't provide you with access to the data / debugging session.

But I imagine it shouldn't be difficult to reproduce - and running out of disk space is something that a backup tool should certainly be tested for (and recover from). If it helps, while the repository is 15 GiB, those archives contain very little changes, and those are mostly added data. 229 increments.

I don't know how attic works internally, but I can imagine it's possible that it ran out of disk space when writing metadata - not actual new data. So to reproduce I would try to generate a lots of increments with very little data changes, on partition with almost full disk.

@hoxu
Copy link
Author

hoxu commented Oct 21, 2015

One way to test near-full disk condition is to use loopback files:

dd if=/dev/zero of=partition bs=256k count=1
mkfs.ext2 partition
sudo mount -o loop partition attictest

Quick testing didn't yield the same error, but this is interesting too (attic 0.13):

Traceback (most recent call last):
  File "/usr/bin/attic", line 3, in <module>
    main()
  File "/usr/lib/python3/dist-packages/attic/archiver.py", line 715, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/usr/lib/python3/dist-packages/attic/archiver.py", line 705, in run
    return args.func(args)
  File "/usr/lib/python3/dist-packages/attic/archiver.py", line 129, in do_create
    archive.save()
  File "/usr/lib/python3/dist-packages/attic/archive.py", line 197, in save
    self.repository.commit()
  File "/usr/lib/python3/dist-packages/attic/repository.py", line 127, in commit
    self.compact_segments()
  File "/usr/lib/python3/dist-packages/attic/repository.py", line 185, in compact_segments
    new_segment, offset = self.io.write_put(key, data)
  File "/usr/lib/python3/dist-packages/attic/repository.py", line 556, in write_put
    fd.write(b''.join((crc, header, id, data)))
OSError: [Errno 28] No space left on device

Yet the archive is created. I imagine if errno 28 is raised and attic bails out, it should not create the archive?

@hoxu
Copy link
Author

hoxu commented Oct 21, 2015

FWIW, here's a link to downstream bug report in Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=802619

@ThomasWaldmann
Copy link
Contributor

@hoxu look what I found: #87

So, could it be that we have multiple issues here, not just running out of disk space, but maybe also a defect memory issue (or some other hw related problem)?

@hoxu
Copy link
Author

hoxu commented Nov 7, 2015

That's always a possibility, but I find it easier to believe that the handling of running out of free disk space is just lacking (IIRC I saw a lot of errors instead of bailing out on first one). And if the Debian version of attic is known to depend on broken python3-msgpack, I don't know if there's any need to look further.

I think it would be good if attic (or any backup tool for that matter) had a test suite that emulated various amounts of available disk space to find out if the error handling leaves the repository in a consistent state always. Run it on btrfs raid1 with ECC memory or something, if worried about faulty HW...

@ThomasWaldmann
Copy link
Contributor

@hoxu emulating out of disk space in testsuite -> that's what I did. Guess it could be used for attic in a similar way. I could NOT reproce your issue with it though, that's why I linked to the other ticket that debugged exactly same crash to being caused by a defect RAM.

borgbackup/borg@22262e3

@hoxu
Copy link
Author

hoxu commented Nov 11, 2015

@ThomasWaldmann Curious. Did you test attic 0.13 and python3-msgpack 0.4.2?

@ThomasWaldmann
Copy link
Contributor

No, I just tested with current borgbackup code.

@hoxu
Copy link
Author

hoxu commented Nov 16, 2015

Well, to verify if the problem exists and whether it's been fixed already, it would have to be reproduced with those versions.

How easy would it be to run that testsuite for attic 0.13?

@ThomasWaldmann
Copy link
Contributor

Not trivial, but possible I guess. We are using py.test as test runner and the package name is "borg" instead of "attic", but a lot of the general project structure is still the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants