Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption in cluster environment with shared storage on ZOL 0.7.0-rc5 and above #6603

Closed
arturpzol opened this issue Sep 5, 2017 · 10 comments

Comments

@arturpzol
Copy link

System information

Type Version/Name
Distribution Name Debian Jessie
Distribution Version 8
Linux Kernel 4.4.45, 3.10
Architecture x86_64
ZFS Version 0.7-rc5 and above
SPL Version 0.7.1-1

Describe the problem you're observing

I experienced data corruption in cluster environment (corosync, pacemaker) with shared storage after force power off one of the cluster node (tested on kvm, vmware and real hardware).

I have one pool:

zpool status
  pool: Pool-0
 state: ONLINE
  scan: none requested
config:

        NAME                                          STATE     READ WRITE CKSUM
        Pool-0                                        ONLINE       0     0     0
          mirror-0                                    ONLINE       0     0     0
            scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-0-4  ONLINE       0     0     0
            scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-0-3  ONLINE       0     0     0

with one zvol (primarycache=metadata, sync=always, logbias=throughput) which is shared to client host.

After force power off one of the cluster node, second node takes over the resource and data corruption on zvol can be observed.

I tested all 0.7.0 rc versions and seems that changes in 0.7.0-rc5 had impact on synchronization. After revert that commit 1b7c1e5 corruption did not occur anymore.

Additional I tried different volblocksize for zvol and seems that only volume with 64k and 128k block size has something broken with synchronization.
If I add ZIL to the pool, corruption also did not happen.

I reported that bug also on #3577 but after deeper analysis I think that it is different bug

@bunder2015
Copy link
Contributor

bunder2015 commented Sep 5, 2017

Can you pull git head and try again? This might be something fixed with f763c3d

@behlendorf
Copy link
Contributor

@arturpzol can you describe the corruption you're able to reproduce.

This potentially could be related to f763c3d which was addressed. Or based on the commit you identified it could be a bug introduced by converting some of the ZIL I/O to be asynchronous. Can you try increasing the zil_slog_bulk module option to a large value, say 1G, zil_slog_bulk=1073741824 and rerunning the test. This will effectively force the first 1G of ZIL writes per txg to by synchronous.

If you're not already aware of it I'd also suggest you enable the new multihost feature when running in a failover environment.

     multihost=on|off
             Controls whether a pool activity check should be performed during
             zpool import.  When a pool is determined to be active it cannot
             be imported, even with the -f option.  This property is intended
             to be used in failover configurations where multiple hosts have
             access to a pool on shared storage.  When this property is on,
             periodic writes to storage occur to show the pool is in use.  See
             zfs_multihost_interval in the zfs-module-parameters(5) man page.
             In order to enable this property each host must set a unique
             hostid.  See genhostid(1) zgenhostid(8) spl-module-paramters(5)
             for additional details.  The default value is off.

@behlendorf behlendorf added this to the 0.8.0 milestone Sep 5, 2017
@arturpzol
Copy link
Author

I tried git head source but corruption occurred.

Also zfs-0.7.1 with patched source using only f763c3d but again corruption occurred.

With set zil_slog_bulk=1073741824 and multihost=on unfortunately the same.

Environment description:

Two physical or virtual machines with shared 2 disks. Zpool is created using shared disks, zpool uses only one mirror vdevs and with two zvols (one has volblocksize=8k, second has volblocksize=128k).
Cluster is set up using corosync and pacemaker. Client machine connects to storage using iSCSI, SCST 3.0 is used to configure one iSCSI target with two LUN's, each LUN is created using underlying zvol created on zpool. Zpool and all zvols has set property sync to value always.
Client machine is Windows OS that runs bst5 test on connected iSCSI storage. SCST is configured to use block IO, write through is enabled for each LUN.

Test description:

Test uses bst5 to write data to iSCSI connected storage from Windows OS. During write I force power off node which currently has pool imported. Second node takeover the cluster resource by importing pool and configuring SCST to share zvol as iSCSI LUNs. Bst5 is able to continue write to LUN's without braking the test. When sequential write finishes I wait until bst5 tool read data with compare. Now bst5 reports data on mismatch error when it read data with compare on LUN which has zvol with volblocksize=128k. For LUN which has zvol with volblocksize=8k corruption is not reported.

I also eliminate cluster enviroment with single node which is force rebooted and automatically import the pool and configuring SCST to share zvol as iSCSI LUNs but again corruption occurred.

Is it safe to revert 1b7c1e5 and using it with ZOL 0.7.1?

@arturpzol
Copy link
Author

I set write tool to save data with blocksize 128kb and above on zvol with volblocksize=128k so second condition from source code:

module/zfs/zvol.c:

if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)
                write_state = WR_INDIRECT;
else if (!spa_has_slogs(zilog->zl_spa) &&
            size >= blocksize && blocksize > zvol_immediate_write_sz)
                write_state = WR_INDIRECT;
else if (sync)
                write_state = WR_COPIED;
else
                write_state = WR_NEED_COPY;

is true for my test so write_state is set to WR_INDIRECT and I assume that it cause the corruption.


Debug showed:

If volblocksize=128kb and save with blocksize=128kb:

size: 131072 , blocksize: 131072 , zvol_immediate_write_sz: 32768

If volblocksize=128kb and save with blocksize=256kb:

size: 262144 , blocksize: 131072 , zvol_immediate_write_sz: 32768

If volblocksize=128kb and save with blocksize=4Mb:

size: 262144 , blocksize: 131072 , zvol_immediate_write_sz: 32768

so each time second condition is true and also corruption occurred.


If volblocksize=128kb and save with blocksize=64kb:

size: 65536 , blocksize: 131072 , zvol_immediate_write_sz: 32768

so third condition is true and corruption did not occur.

Should if (sync) be as first condition or second condition is broken?

@behlendorf
Copy link
Contributor

@arekinath thanks for the additional information. I'm investigating reproducing this issue with a simpler test case to further debug it.

@arturpzol
Copy link
Author

arturpzol commented Sep 7, 2017

After perform test scenario that caused corruption with below changes:

--- zfs/zvol.c  (revision 47737)
+++ zfs/zvol.c  (working copy)
@@ -684,13 +684,13 @@
        if (zil_replaying(zilog, tx))
                return;
 
-       if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)
+       if (sync)
+               write_state = WR_COPIED;
+       else if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)
                write_state = WR_INDIRECT;
        else if (!spa_has_slogs(zilog->zl_spa) &&
            size >= blocksize && blocksize > zvol_immediate_write_sz)
                write_state = WR_INDIRECT;
-       else if (sync)
-               write_state = WR_COPIED;
        else
                write_state = WR_NEED_COPY;

issue did not occur anymore for different volblocksize and write blocksize.

@behlendorf
Copy link
Contributor

@arturpzol that's one way to side-step the issue for the moment, it effectively disabes WR_INDIRECT log records when sync=always. That will hurt performance but seems to avoid the issue until we have a root cause for the core issue with the indirect log records.

behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 8, 2017
The portion of the zvol_replay_write() handler responsible for
replaying indirect log records for some reason never existed.
As a result indirect log records were not being correctly replayed.

This went largely unnoticed since the majority of zvol log records
were of the type WR_COPIED or WR_NEED_COPY prior to OpenZFS 7578.

This patch updates zvol_replay_write() to correctly handle these
log records and adds a new test case which verifies volume replay
to prevent any regression.  The existing test case which verified
replay on filesystem was renamed slog_replay_fs.ksh for clarity.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6603
@behlendorf
Copy link
Contributor

@arturpzol I've open #6615 which fixes the root cause of this issue and adds a test case to prevent any regression. I'd appreciate it if you could also verify the fix in your environment. Thanks!

@arturpzol
Copy link
Author

@behlendorf seems that fix works. I performed all tests with different volblocksize and write block size and corruption did not occur. Thanks for fix.

behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 8, 2017
The portion of the zvol_replay_write() handler responsible for
replaying indirect log records for some reason never existed.
As a result indirect log records were not being correctly replayed.

This went largely unnoticed since the majority of zvol log records
were of the type WR_COPIED or WR_NEED_COPY prior to OpenZFS 7578.

This patch updates zvol_replay_write() to correctly handle these
log records and adds a new test case which verifies volume replay
to prevent any regression.  The existing test case which verified
replay on filesystem was renamed slog_replay_fs.ksh for clarity.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6603
behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 8, 2017
The portion of the zvol_replay_write() handler responsible for
replaying indirect log records for some reason never existed.
As a result indirect log records were not being correctly replayed.

This went largely unnoticed since the majority of zvol log records
were of the type WR_COPIED or WR_NEED_COPY prior to OpenZFS 7578.

This patch updates zvol_replay_write() to correctly handle these
log records and adds a new test case which verifies volume replay
to prevent any regression.  The existing test case which verified
replay on filesystem was renamed slog_replay_fs.ksh for clarity.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#6603
@behlendorf
Copy link
Contributor

@arturpzol thanks for reporting this and verifying the fix. We'll get this in to the zfs-0.7-release branch for the next point release.

@behlendorf behlendorf removed this from the 0.8.0 milestone Sep 8, 2017
tonyhutter pushed a commit that referenced this issue Sep 13, 2017
The portion of the zvol_replay_write() handler responsible for
replaying indirect log records for some reason never existed.
As a result indirect log records were not being correctly replayed.

This went largely unnoticed since the majority of zvol log records
were of the type WR_COPIED or WR_NEED_COPY prior to OpenZFS 7578.

This patch updates zvol_replay_write() to correctly handle these
log records and adds a new test case which verifies volume replay
to prevent any regression.  The existing test case which verified
replay on filesystem was renamed slog_replay_fs.ksh for clarity.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6603
Fabian-Gruenbichler pushed a commit to Fabian-Gruenbichler/zfs that referenced this issue Sep 29, 2017
The portion of the zvol_replay_write() handler responsible for
replaying indirect log records for some reason never existed.
As a result indirect log records were not being correctly replayed.

This went largely unnoticed since the majority of zvol log records
were of the type WR_COPIED or WR_NEED_COPY prior to OpenZFS 7578.

This patch updates zvol_replay_write() to correctly handle these
log records and adds a new test case which verifies volume replay
to prevent any regression.  The existing test case which verified
replay on filesystem was renamed slog_replay_fs.ksh for clarity.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#6603
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants