Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASSERT at zdb.c:3715:load_concrete_ms_allocatable_trees() #7672

Closed
dioni21 opened this issue Jul 2, 2018 · 9 comments
Closed

ASSERT at zdb.c:3715:load_concrete_ms_allocatable_trees() #7672

dioni21 opened this issue Jul 2, 2018 · 9 comments

Comments

@dioni21
Copy link
Contributor

dioni21 commented Jul 2, 2018

System information

Distribution Name | Fedora
Distribution Version | 27
Linux Kernel | 4.16.16-200.fc27.x86_64
Architecture | x86_64
ZFS Version | zfs-0.7.9-1.fc27.x86_64 (from yum repo)
SPL Version | spl-0.7.9-1.fc27.x86_64

Describe the problem you're observing

I am debugging a problem while copying a whole pool to a new drive using zfs send/recv. Some files on the receiving side have different checksums.

While running zdb -cccv with the installed version, I got a segmentation fault. So, I went to try a newer version, and compiled ZDB from master repo (commit id e03a41a). Now I have an assertion.

Describe how to reproduce the problem

./zdb -cccvv tank

Traversing all blocks to verify checksums and verify nothing leaked ...

loading concrete vdev 0, metaslab 69 of 145 ...space_map_load(msp->ms_sm, msp->ms_allocatable, maptype) == 0 (0x5 == 0x0)
ASSERT at zdb.c:3715:load_concrete_ms_allocatable_trees()Aborted (core dumped)

Include any warning/errors/backtraces from the system logs

Nothing useful:

Jul  2 13:12:52 nexus systemd[1]: Started Process Core Dump (PID 21018/UID 0).
Jul  2 13:12:52 nexus systemd-coredump[21020]: Resource limits disable core dumping for process 31773 (lt-zdb).
Jul  2 13:12:52 nexus systemd-coredump[21020]: Process 31773 (lt-zdb) of user 0 dumped core.
Jul  2 13:12:52 nexus abrt-dump-journal-core[17302]: Failed to obtain all required information from journald
Jul  2 13:12:52 nexus abrt-dump-journal-core[17302]: Failed to save detect problem data in abrt database

How could I help more?

@dioni21
Copy link
Contributor Author

dioni21 commented Jul 5, 2018

Running zdb -cccvvAAA it can get past this and finally dies with SEGV.

No error from the operating system.

No error found, after two scrub passes.

@rincebrain
Copy link
Contributor

@dioni21 Mixing and matching userland and kernel versions is gonna produce exciting results. I would suggest that, if you want to try using a git version, purge all traces of the old SPL/ZFS packages, and then install the git version.

What do you mean by "have differing checksums"? According to e.g. md5sum, or zdb examining on the affected files, or ...?

You haven't included anything about the source pool layout, or the properties on the datasets, or even the arguments for running send|recv.

@dioni21
Copy link
Contributor Author

dioni21 commented Jul 5, 2018

@rincebrain Thanks for your answer.

Using mixed zdb was a last resort after a SIGSEGV with no more info. I know it is not recommended. Since the matching zdb/kernel setup did not had this assertion, maybe should I close this issue? Sorry for that...

I think I found the reason for this SEGV (default inflight I/Os), I'll fill another issue as soon as I confirm. Since my disks are SATA, every full disk operation takes a long time.

"differing checksums" => According to md5sum, or, to be more specific, mtree (yes, I'm a FreeBSD user running Linux)

This is a simple home setup. I am upgrading a 2x4TB pool to a 2x10TB pool, both configured with simple mirroring, and both with log and cache on SSD LVM partitions. There are about 11 dataset, some with dedup, some with compression, some without.

Also, I think that file corruption was caused by using a full parameter zfs send (--dedup --large-block --replicate --embed --compressed --props). I've seen a previous bug with this config, but it is marked as solved.

@rincebrain
Copy link
Contributor

@dioni21 Which bug did you see with this config?

The only ones I can think of involving misbehavior with send|recv are #6224 or #4809; the former shouldn't happen if you specify -L, as I understand it, and the latter should be mitigated by the default-on tunable on platforms from 2017 on.

@dioni21
Copy link
Contributor Author

dioni21 commented Jul 5, 2018

@rincebrain If I understood correctly, #6224 is not applied to my setup (0.7.9). This may explain why I got errors even without any zend options. I'll try with only --large-block --replicate --props as soon as current zdb run finishes.

#4809 appear exactly with my problem, I'm not sure if this is the one I've read about before. The feature@hole_birth is active on both my pools. The source pool is very old, but the destination pool has just been created, in 0.7.9. Also, it is currently disabled in parameters:

/sys/module/zfs/parameters/ignore_hole_birth:1
/sys/module/zfs/parameters/send_holes_without_birth_time:1

Should I worry?

@rincebrain
Copy link
Contributor

@dioni21 If the source doing the sending has either of those tunables, #4809 shouldn't happen. What makes you think it's #4809 and not some other kind of data mangling? Have you looked to see if the affected files are the same every time you send/recv and how they differ between src and dst using e.g. zdb?

@dioni21
Copy link
Contributor Author

dioni21 commented Jul 5, 2018

@rincebrain I do not know yet the reason for corruption. Still searching.

ZDB failure is what started this issue. Right now I could only use mtree/md5 and/or rsync -c to check file consistency.

What I already know:

  1. A new copy of the same dataset above with all options (not sure about that, though) except --replicate generate a faulty dataset, but with different files with error from the previous copy.
  2. A new copy of one dataset (with dedup and compression) with only --props in zfs send passed without errors.
  3. I have checked a big file with error. Only the final part of it is in different. (similar to Silently corrupted file in snapshots after send/receive #4809)?

Note that I made these new copies without deleting the previous ones, since the new pool has much more free space. Also, the source pool is still in "production" with all my personal stuff, changing content as we talk...

@dioni21
Copy link
Contributor Author

dioni21 commented Jul 10, 2018

My last tests lead me to believe the reason for corruption is send --dedup. Taking it off was enough to copy all datasets with no md5sum error in content.

I opened a new issue, #7703

Now that I could copy all data without corruption, I'll try zdb -ccc again ASAP.

@dioni21
Copy link
Contributor Author

dioni21 commented Jul 10, 2018

@rincebrain closing this issue since, as you pointed, could have been caused by using mixed kernel/userland binary versions. I'll open another if I can find more details about the SIGSEGV.

Thanks a lot...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants