-
Notifications
You must be signed in to change notification settings - Fork 1.7k
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM 32bit - Data corrupted on import raidz pool on zfs-0.7.4 #6981
Comments
Some more info... it seems that this problem is not present with mirror pool or strip pool:
|
You could try changing the value of /sys/module/zfs/parameters/zfs_vdev_raidz_impl from "fastest" to "original" on 0.7.4 and see if it still thinks the pool is in a bad way. (Also, I'd be curious to see what the contents of it is on your system, as I don't have an ARM system with ZFS handy.) I would really hope this doesn't change anything, but it being specific to RAIDZ makes me curious. (Also, what ARM platform specifically? Is this, say, a Raspberry Pi of some flavor, or something else? What's /proc/cpuinfo say?) |
Hi, ARMv7 Banana PI :) Some results with original mode.
Modinfo:
|
Hm, I couldn't readily reproduce this on a new pool I created on an x64 box with 0.6.5.9 or on a new pool I created on my poor RPi3 with 0.6.5.9 and tried importing on 0.7.4 (both worked fine). You could also try "zpool import -C /tmp/zpool.cache data" to see if it's an issue with zpool.cache having some state that's incorrect, but I would be at least mildly surprised if that were the issue. |
If could be helpful, this pool is very old and upgraded with new version at every new release. I don't remember what is initial release but could be maybe 0.6.0 so I don't know if could be related with this. About your test .. do you means -c (not -C) ? how can I create /tmp/zpool.cache file ? I will try to see if on amd64 I have same issue. |
I do this test:
It's seems that I receive same error that I receive without use of zpool.cache file. |
The experiment I wanted to run was explicitly the opposite of that - I wanted you to tell it to use a cachefile that doesn't exist and/or delete the extant one, not point it at a copy from 0.6.5.9. |
I can't use a file that doesn't exist:
|
@geaaru Try import -c none, then. |
@rincebrain: I think that doesn't exists -c none option... try always to found a file with name "none". However, hereinafter command output:
|
Hi, I test also with last version 0.7.5 and I have same issue on arm. |
Additional informations: I attached same zpool on amd64 with zfs-0.7.4 and I dont' receive "FAULTED corrupted data". So, it seems an issue present on arm 32bit environment. |
By chance is the ARMv7 Banana PI a big endian system? |
Output of
I've been running ZFS on this system for at least a couple of years now without any major issue; i only use mirrors though. |
On same storage I have both mirror and strip and I confirm that works fine. Problem is only with raidz. I'm trying to downgrade to 0.7.1 to identify a minor range for bisect. I will align you about results asap. Thanks at all for support. |
@geaaru something else that might be interesting to check, the contents of /proc/spl/kstat/zfs/fletcher_4_bench and trying the other implementations (by changing /sys/module/zcommon/parameters/zfs_fletcher_4_impl) to see if the corrupted metadata message goes away on 0.7.X. |
Hi, executed these tests:
About your questions: With zfs-0.7.5+kernel-4.9.22:
|
I remember that I found this issue in the past. Problem it was available on 0.7.0-r3 (see my previous issue #6031). So, this means that 0.6.5.9 from my side it has patch of issue #6031 and so issue about raidz could be related with a regression from 0.7.0-r3 or 0.7.0-r4/r5. Anyone have an idea where could be the regression to simplify bisect process ? |
Some others informations... I test raidz1 from simple file:
And this case it seems that works fine. So, i'm not sure where is issue. Probably it is related with a strange status of pool. It's very strange that this issue is not present on amd64 arch. Could be relative to an issue on checksum algorithm on 32bit arch ? But then why with a pool from file it works fine. Thanks in advance for any suggestions. A possible solution it seems destroy and create pool but there are a lot of data :'( |
@geaaru Yeah, I tried reproducing it with pools created with 0.6.5.11 and 0.7.4 on both my RPi3 and an amd64 VM, and couldn't, so I'm guessing it's something about how old the pool is, and that's...a very large space to search, compared to politely asking you to git bisect. :) |
Ok, I proceed with rebuild of the pool. However, it's strange that on 0.6.5.11 it works fine. Second, on amd64 same pool with 0.7.4 it works fine. Only for information. Between 0.6.x and 0.7.x are there a lot of changes on checksum functions ? I will close issue after some more tests with new pool. Thanks at all for support. |
Well, maybe I found correct steps to reproduce this issue. After remove all data and create a new pool (from amd64 environment):
again some issue. So, I confirm that problem is not related with a problem with an old pool. I try again with these tests:
|
Does it matter which dataset you do the send/recv of? What do you mean, "After that I execute rollback to snapshot received with "zfs receive" command to enable filesystem." - you shouldn't need to do a rollback to "enable" a filesystem in some way. What versions of ZFS were you using on the amd64 and the ARM machines? What version was the zfs send done on, and with which flags? |
Sorry, I will try to clarify my actions. Preface:
On copy all data from broken pool for about 500GB to avoid simply my work instead of execute classic command "cp" I used "zfs send" to store data on another temporary pool (I'm not sure that this copy also broken metadata) without raidz (but with copies=2 as option). So, after create a new pool "data" with three block of raidz1 I execute from temporary pool this command:
After complete this I test on arm and I reproduce same issue. Just now I tried to create all data pool (with three block of raidz1) from amd64 and try to see on arm if it is correctly imported and reply is yes:
It seems that empty raid are correctly imported. So, now is in progress copy of same data from temporary pool not with "zfs send" but with simple "cp" command always on amd64 environment. Tomorrow I will said you if also with copy of data directly with "cp" I can reproduce issue. If yes, last step is then execute copy of data directly on arm environment. |
Hi, I confirm that also with direct "cp" command (on amd64) when I try to import pool on ARM I receive same issue.
Now I try to copy data directly from ARM but however this is a clear symptom that there is some of strange on elaborate raidz1 metadata (or data) created on amd64 environment and then import on arm (32bit). |
Also when I try to copy data directly with "cp" or I use "zfs send + zfs receive" from another pool (without raid) I receive errors on checksum:
This errors aren't visibile with some storage disks and data on amd64 arch. On my bpi I currently limit arc_max to this value:
|
@geaaru So, I just tried to reproduce this again, by making a pool on 0.7.3 amd64 out of 3 3disk raidz1 vdevs, zfs send|recv a dataset onto it, zfs rollback it, export it, then import it on my RPi with 0.7.4, and no problems. So if you can find a way to reproduce this making a small pool and sending a small dataset, that'd be quite useful. |
@rincebrain Hi, could be related with some big files ? I'm don't understand because from my side I have this errors that you can't reproduce. In my case on pool I have big file of video camera also of 10GB. About this, I will try to create raidz and copy only small files and then I try to reproduce for create a test case. Thanks |
@rincebrain Maybe I'm in the right road.... If I try to create small files works fine:
In this case import it works fine on ARM. Then if I try to create a bigfile on first level directory it seems works:
This script it seems reproduce always my issue:
I'm not sure that problem is related with bigfile but maybe how this file is been save inside different raid block. Is there a way to analyze how a file is been splitted across pool ? It seems that some problem is available also on execute my script directly on ARM. When script is completed correctly then is needed execute zpool export and then zpool import but then is not possible execute import of the pool.
Second test:
Some errors already before zpool export with same dmesg errors:
and then:
Can you try if also from your side issue is reproducible? |
@ironMann It doesn't matter whether it's set to "original", "fastest", "scalar", "superscalar"; all settings produce this outcome. I even just rechecked it with explicitly setting the module parameter on load in case somehow changing it dynamically wasn't sticking. raidz_test -S segfaulted at ab9f4b0, which is neat. I'm going to rebuild with 0.7.5 and try again. The stacktrace I got is at https://gist.github.com/rincebrain/b811097361db0110df48a3c1b3e670ab. |
@ironMann Testing on my machine, raidz_test -S is fully happy with the state of the universe, as far as I can tell, and yet I can reproduce the issue (see above) even with the "original" raidz implementation.
I am not sure what my kernel stack size is; rummaging around in /proc/config.gz, I don't immediately see any options that would take it away from whatever the default is. |
|
For the indication of stack size check your zfs config.log; there should be line |
The original reporter was doing this on real disks, I was just using flatfiles because my RPi, unlike his Banana Pi, doesn't have SATA.
Yup.
Already tried, and it still happens even with checksum=sha256 on the resulting dataset as well.
I'll try and report back, but since that commit is from August 2017, and this repros on ab9f4b0, I'm not hopeful. |
@ironMann I am testing on real disk devices (over USB), not layered atop another filesystem. I would expect stack corruption to crash the machine or exhibit other odd failure modes, not reliable corruption of on-disk data. |
So, I'm not an expert on this code, but shouldn't this have caused some kind of error I could see without passing -vvv? https://gist.github.com/rincebrain/c45b7663682c0a26f3e4c98f9c7152e6 e: Nevermind, -T is expected to fail all tests, that's what happens when I stay up too early working on things. |
@nwf @rincebrain |
In thinking about it, the use of
with (refactored by ab9f4b0)
I am deeply confused by the un-commented-upon and seemingly un-merited changes to types made as part of this refactoring; it is suggestive that preserving the original code's semantics exactly was an afterthought, not a design priority. (The structures should simply have been relocated, not mutated, if semantics preservation were a foremost concern.) Since none of these fields are, in fact, sizes of objects in memory (i.e.
I would like to see the corresponding parts of ab9f4b0 reverted entirely (reverting to
|
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Fixes: openzfs#6981
The patch above fixes this issue for me and should cause no badness elsewhere. Assuming that's true, I'd like to propose it for inclusion in the next 0.7 point release as well as on master. @behlendorf, could we get an ARM32 testbot to go along with the existing buildbot? It'd be nice to know things were being looked over automagically. :) |
Yup, we'll get this fix in to 0.7.6 and master. As for adding a buildbot for ARM32 I'll see what can be done, it certainly would be nice. |
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue #6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes #6981 Closes #7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes openzfs#6981 Closes openzfs#7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes openzfs#6981 Closes openzfs#7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes openzfs#6981 Closes openzfs#7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes openzfs#6981 Closes openzfs#7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes openzfs#6981 Closes openzfs#7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes openzfs#6981 Closes openzfs#7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes openzfs#6981 Closes openzfs#7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes openzfs#6981 Closes openzfs#7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes openzfs#6981 Closes openzfs#7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes openzfs#6981 Closes openzfs#7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue #6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes #6981 Closes #7023
As part of the refactoring of ab9f4b0, several uint64_t-s and uint8_t-s were changed to other types. This caused ZoL github issue openzfs#6981, an overflow of a size_t on a 32-bit ARM machine. In absense of any strong motivation for the type changes, this simply puts them back, modulo the changes accumulated for ABD. Compile-tested on amd64 and run-tested on armhf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu> Closes openzfs#6981 Closes openzfs#7023
Hi, on upgrade to zfs-0.7.4 I can't import previously zfs pool with raidz.
System information
Describe the problem you're observing
Describe how to reproduce the problem
On reboot with previous kernel 4.9.22 + zfs-0.6.5.9 pool is then not corrupted.
The text was updated successfully, but these errors were encountered: