Critical: Possible data loss during rebalance if there is any linkfile on the system (gluster-11 version) #4148

rafikc30 · 2023-05-16T21:23:17Z

Description of problem:
The cause of the data loss was the Rebalance daemon mistakenly deleting a data file instead of a linkfile, resulting in the deletion of the entire file from the gluster namespace and causing complete data loss. In addition to the file being deleted, a modification to the data file can also cause data loss.

This data loss can occur in the following situations:

Deletion of the entire file:
This situation arises when there is a linkfile present on the hashed subvol. In this case, the rebalance daemon adds both the linkfile and the data file to the queue. Suppose the data file entry is picked first for migration and successfully migrated. When the linkfile is subsequently picked from the queue, the rebalance daemon checks whether the added linkfile has both hashed and cached to the same subvol. Mistakenly assuming it is a linkfile, the daemon deletes it. The check for the linkfile is based on the stbuf populated during the rebalance readdir. However, by the time the check is performed, the file has already been migrated and has become a data file.

Incorrect permissions and size:
Similar to the previous case, this occurs when a linkfile and a data file are selected for migration simultaneously. In this situation, a failure can occur at various stages of the rebalance process. Due to this failure, a cleanup operation might be performed on the destination where the rebalance process is incomplete. However, by the time the first file is completed and the cleanup is executed, it is performed on an actual file, resulting in incorrect permissions (e.g., "") or an incomplete data file with incorrect information.

The exact command to reproduce the issue:

The full output of the command that failed:

Expected results:

Mandatory info:
- The output of the gluster volume info command:

- The output of the gluster volume status command:

- The output of the gluster volume heal command:

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump

Additional info:
A possible scenrio to have link file include rename operation, The used space on one of the replica set is above min-free-disk(90% by default)

- The operating system / glusterfs version:
gluster11

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

The text was updated successfully, but these errors were encountered:

Rebalance daemon is deleting a data file instead of linkfile which cause an entire file to be deleted from gluster namespace leading to a complete data loss. This data loss occurs in following situations:- 1) Entire file is deleted This happens when there is a linkfile on the hashed subvol. In this case, rebalance daemone will add both linkfile and data file to the queue. Let's say the data file entry picked up first to migrate and let's say the entry is migrated succesfully. When the linkfile is picked up from the queue, when they check that the added linkfile has both hashed and cached to the same subvol, then they assume that it is a linfile and then it got deleted. They check for the linkfile is performed on the stbuf populated during the rebalance readdir. So by the time when the check are done, the file is now migrated and had a datafile 2) Wrong permission and Incorrect size: Similar to above case, this happens when a linkfile and data file are picked up to migrate at the same time, in this case, a failure can happen at different stage of rebalance process and due to this rebalance failure, a cleanup might be done on the destination where the rebalance has incomplete, but by the time the first one might be completed and the cleanup would have done on an actual file leading to permissions like "", either we may end up in wrong permissions like "----------." or a data file with incomplete information. Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc Fixes: gluster#4148 Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>

Rebalance daemon is deleting a data file instead of linkfile which causes an entire file to be deleted from gluster namespace leading to a complete data loss. This data loss occurs in the following situations:- 1) Entire file is deleted This happens when there is a linkfile on the hashed subvol. In this case, rebalance daemon will add both linkfile and data file to the queue. Let's say the data file entry picked up first to migrate and let's say the entry is migrated successfully. When the linkfile is picked up from the queue, when they check that the added linkfile has both hashed and cached to the same subvol, then they assume that it is a linfile and then it got deleted. They check for the linkfile is performed on the stbuf populated during the rebalance readdir. So by the time when the check are done, the file is now migrated and had a datafile 2) Wrong permission and Incorrect size: Similar to above case, this happens when a linkfile and data file are picked up to migrate at the same time, in this case, a failure can happen at different stages of rebalance process and due to this rebalance failure, a cleanup might be done on the destination where the rebalance has incomplete, but by the time the first one might be completed and the cleanup would have been done on an actual file due to this either we may end up with the wrong permissions like "----------." or a data file with incomplete information. Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc Fixes: gluster#4148 Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>

Rebalance daemon is deleting a data file instead of linkfile which causes an entire file to be deleted from gluster namespace leading to a complete data loss. This data loss occurs in the following situations:- 1) Entire file is deleted This happens when there is a linkfile on the hashed subvol. In this case, rebalance daemon will add both linkfile and data file to the queue. Let's say the data file entry picked up first to migrate and let's say the entry is migrated successfully. When the linkfile is picked up from the queue, when they check that the added linkfile has both hashed and cached to the same subvol, then they assume that it is a linfile and then it got deleted. They check for the linkfile is performed on the stbuf populated during the rebalance readdir. So by the time when the check are done, the file is now migrated and had a datafile 2) Wrong permission and Incorrect size: Similar to above case, this happens when a linkfile and data file are picked up to migrate at the same time, in this case, a failure can happen at different stages of rebalance process and due to this rebalance failure, a cleanup might be done on the destination where the rebalance has incomplete, but by the time the first one might be completed and the cleanup would have been done on an actual file due to this either we may end up with the wrong permissions like "----------." or a data file with incomplete information. Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc Fixes: #4148 Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>

Rebalance daemon is deleting a data file instead of linkfile which causes an entire file to be deleted from gluster namespace leading to a complete data loss. This data loss occurs in the following situations:- 1) Entire file is deleted This happens when there is a linkfile on the hashed subvol. In this case, rebalance daemon will add both linkfile and data file to the queue. Let's say the data file entry picked up first to migrate and let's say the entry is migrated successfully. When the linkfile is picked up from the queue, when they check that the added linkfile has both hashed and cached to the same subvol, then they assume that it is a linfile and then it got deleted. They check for the linkfile is performed on the stbuf populated during the rebalance readdir. So by the time when the check are done, the file is now migrated and had a datafile 2) Wrong permission and Incorrect size: Similar to above case, this happens when a linkfile and data file are picked up to migrate at the same time, in this case, a failure can happen at different stages of rebalance process and due to this rebalance failure, a cleanup might be done on the destination where the rebalance has incomplete, but by the time the first one might be completed and the cleanup would have been done on an actual file due to this either we may end up with the wrong permissions like "----------." or a data file with incomplete information. > Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc > Fixes: gluster#4148 > Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com> > (Reviewed on gluster#4149) > (Cherry picked from commit 1cedcc0) Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc Fixes: gluster#4148 Signed-off-by: Mohit Agrawal<moagrawa@redhat.com>

Rebalance daemon is deleting a data file instead of linkfile which causes an entire file to be deleted from gluster namespace leading to a complete data loss. This data loss occurs in the following situations:- 1) Entire file is deleted This happens when there is a linkfile on the hashed subvol. In this case, rebalance daemon will add both linkfile and data file to the queue. Let's say the data file entry picked up first to migrate and let's say the entry is migrated successfully. When the linkfile is picked up from the queue, when they check that the added linkfile has both hashed and cached to the same subvol, then they assume that it is a linfile and then it got deleted. They check for the linkfile is performed on the stbuf populated during the rebalance readdir. So by the time when the check are done, the file is now migrated and had a datafile 2) Wrong permission and Incorrect size: Similar to above case, this happens when a linkfile and data file are picked up to migrate at the same time, in this case, a failure can happen at different stages of rebalance process and due to this rebalance failure, a cleanup might be done on the destination where the rebalance has incomplete, but by the time the first one might be completed and the cleanup would have been done on an actual file due to this either we may end up with the wrong permissions like "----------." or a data file with incomplete information. > Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc > Fixes: #4148 > Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com> > (Reviewed on #4149) > (Cherry picked from commit 1cedcc0) Change-Id: I22f1c9b35811be83bde08c4f30db9d5582d2d2fc Fixes: #4148 Signed-off-by: Mohit Agrawal<moagrawa@redhat.com> Co-authored-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>

rafikc30 mentioned this issue May 16, 2023

dht/rebalance: Fix a problem deleting data file #4149

Merged

mohit84 closed this as completed in #4149 May 18, 2023

mohit84 reopened this May 18, 2023

mohit84 mentioned this issue May 18, 2023

dht/rebalance: Fix a problem deleting data file #4155

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Critical: Possible data loss during rebalance if there is any linkfile on the system (gluster-11 version) #4148

Critical: Possible data loss during rebalance if there is any linkfile on the system (gluster-11 version) #4148

rafikc30 commented May 16, 2023

Critical: Possible data loss during rebalance if there is any linkfile on the system (gluster-11 version) #4148

Critical: Possible data loss during rebalance if there is any linkfile on the system (gluster-11 version) #4148

Comments

rafikc30 commented May 16, 2023